Netbox

Netbox is used by Wikimedia as a tool for data center infrastructure management (DCIM) and IP address management (IPAM). It also serves as an integration point for switch and port management, DNS management, and other network operations.

Web UI

https://netbox.wikimedia.org/
Login using your Developer account/LDAP/Wikitech credentials (Hiera profile::netbox::cas_group_attribute_mapping, *NOTE*: To login you must be a member of either the wmf or nda group as membership in those sets is_active=True)
- nda group members have a partial read-only access - task T302870
- wmf group members have a full read-only access
- ops group members have full read-write access
- To request membership in any of those groups, see SRE/LDAP/Groups
Content-Security-Policy headers are set - task T296356

API

Endpoint
- From an internal host (eg. everything BUT your laptop or WMCS), you MUST use netbox.discovery.wmnet
REST API
- Create a token on https://netbox.wikimedia.org/user/api-tokens/, ideally read-only, ideally with an expiration date
- Doc: https://netbox.wikimedia.org/api/docs/
- Python library: https://github.com/netbox-community/pynetbox/
- Note that the REST API is quite slow, make sure to optimize your queries and use pynetbox threading
GraphQL
- https://netbox.wikimedia.org/graphql/
Spicerack
- See https://doc.wikimedia.org/spicerack/master/api/spicerack.netbox.html for the Netbox support
- It's preferred to use the built in wrapper functions and not the pynetbox interface directly as your cookbook might break if they're not updated when Netbox introduces breaking changes

Staging

It consists of a single bookworm VM (netbox-dev2003) combining frontend, redis and database
Reachable on netbox-next.wikimedia.org and netbox-next.discovery.wmnet
- Behind caches, similarly to the prod infrastructure
It's data comes from a manual dump of production's database
- Reach out to Infrastructure Foundations if you need a more fresh database
- Be careful not to leak any of its data
It is used to test Netbox upgrades, scripts, reports, etc
This host is active in monitoring (with notifications disabled)
- As such, make sure that all the alerts cleared after your tests

Production infrastructure

the production Netbox infrastructure consists of 4 bullseye VMs (see all Netbox VMs):

2 active/passive frontends (netboxXXXX)
- Using the central Redis
2 primary/replica postgresSQL databases (netboxdbXXXX)

By default the active/primary servers are the eqiad ones.

The public endpoint is behind our CDN so the request flow is:

CDN - (using the wildcard *.wikimedia.org as its TLS certificate)
active frontends
1. Apache (using cfssl for its TLS certificate)
2. Django app (through uwsgi)
Active database

Monitoring

Icinga

See all Netbox related Icinga checks: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=netbox

In addition to the regular set of VM checks that are run on all servers, there are Icinga checks that only run on the active servers.

Controlled by the profile::netbox::db::primary and profile::netbox::active_server Hiera keys.

Frontends

Controlled by the profile::netbox::active_server Hiera key:

Alerting for the Ganeti sync systemd timers (are they running correctly?) - see also Netbox#Ganeti sync
Alerting for the Netbox reports (is there invalid data in Netbox?) - see also Netbox#Reports
Alerting for the DNS export automation (are there Uncommitted DNS changes in Netbox?) - See also Netbox#DNS

Databases

The replica have a check for replication delay.

Prometheus

Setup task: https://phabricator.wikimedia.org/T243928

Global health overview (beta): https://grafana.wikimedia.org/d/DvXT6LCnk/

Logstash

https://logstash.wikimedia.org/app/dashboards#/view/AXB84iDRKWrIH1QRIR_j

Failover

Frontends

Using confctl, pool the passive server and depool the previous active one.

confctl --object-type discovery select 'dnsdisc=netbox,name=codfw' set/pooled=true
confctl --object-type discovery select 'dnsdisc=netbox,name=eqiad' set/pooled=false

If the failover is going to last (eg. longer than a server reboot), change the profile::netbox::active_server Hiera key to the backup server. This will ensure the cron/systemd timers as well as Icinga checks are running.

Note that having the active frontend in a different datacenter than the primary database will result in Netbox being slower.

Databases

If the primary database server needs a short downtime it's recommended to not try a failover and instead have Netbox offline for a short amount of time.

There are currently no documented procedure on how to fail the database over, and even less how to fail back to the former primary.

Database

Restore

First of all analyze the Netbox changelog to choose what's the best action to perform a restore.

The general options are:

Manually (or via the API) re-play the actions listed in the changelog in reverse order. The changelog entries don't have full raw data, some of them might show the names instead of the IDs required in the API
Restore a database dump. This ensure consistency at a given point in time, and could even be used to perform some partial restore using pg_restore.

To restore files from Bacula back to the client, use bconsole on helium and refer to Bacula#Restore_(aka_Panic_mode) for detailed steps.

PostgresSQL

Dumps backups

On the database servers, a Puppetized systemd timer (class postgresql::backup) automatically creates a daily dumps file of all local Postgres databases (pg_dumpall) and stores it in /srv/postgres-backup

On primary node: daily dumps
On secondary node: hourly dumps

This path is then backed up by Bacula#Adding a new client

For more details, the related subtask to setup backups was Phab:T190184, improved in task T262677

Stop the Netbox services before restore

Postgres may prevent us from deleting the current database before the restore if there are active remote connections to it. This is usually not the case but it has been observed on occasion. To minimize the chance of it happening we should stop the Netbox services on primary netbox host (as of June 2022 netbox1002) prior to restoring the db.

First downtime the active netbox server with the downtime cookbook from a cumin host:

sudo cookbook sre.hosts.downtime --minutes 30 -r "Restoring DB from backup on netboxdb1002" -t <task> netbox1002.eqiad.wmnet

Then on the active netbox host itself:

sudo systemctl stop rq-netbox
sudo systemctl stop uwsgi-netbox.service

Restore the DB dump

NOTE: The below instructions are valid to copy live db dump to 'netbox-next', the postgress db runs locally on that host rather than connecting to a dedicated DB VM. Also see note below about running puppet afterwards to change the netbox db user password (which is different on the dev host so should be changed if we dump the db from live one).

Check the dump files on the secondary DB host (as of Dec. 2022 netboxdb2003) in /srv/postgres-backup, if any issue with those files, do the same on the primary host. The secondary host performs hourly backups while the primary only daily.
If the secondary host has a more newer backup:
- Copy the dump to the primary DB host (as of Dec. 2022 netboxdb1003), you can use from one of the cumin hosts (cumin1002.eqiad.wmnet, cumin2002.codfw.wmnet) as root, run:
```
SSH_AUTH_SOCK=/run/keyholder/proxy.sock scp -3 root@netboxdb2003.codfw.wmnet:/srv/postgres-backup/psql-all-dbs-latest.sql.gz root@netboxdb1003.eqiad.wmnet:/srv/
```
- SSH into the primary DB host (as of Nov. 2024 netboxdb1003)
- Change the permissions of the copied backup to be owned by postgres:postgres
Take a one-off backup on the primary DB host (as of Dec. 2022 netboxdb1003) right before starting the restore with (the .bak suffix is important to not be auto-evicted):
```
$ su - postgres
$ /usr/bin/pg_dumpall | /bin/gzip > /srv/postgres-backup/${USER}-DESCRIPTION.psql-all-dbs-$(date +\%Y\%m\%d).sql.gz.bak
```

Become postgres user:
```
sudo -i -u postgres
```
Connect to the DB, list and drop the Netbox database:

$ psql
postgres=# \l
...
postgres=# DROP DATABASE netbox;
DROP DATABASE
postgres=#

# NOTE - you may still get a message saying 'database "netbox" is being accessed by other users' which prevents you dropping the db.  These can be from the active netbox host, running reports triggered by systemd timers, and the backup db host.  It is probably easiet to wait until these complete and re-try, if it cannot wait probably the services/processes connecting from those remote hosts need to be stopped.  As a last resort 'DROP DATABASE "netbox" WITH(FORCE);' can be used.

Still as the postgres user, restore the DB with:

$ gunzip < /srv/psql-all-dbs-SOME_DATE.sql.gz | /usr/bin/psql

DEV Host password

If the dump has been restored to the DEV host hosting netbox-next, run puppet to fix the netbox DB user password at this time. NOTE: It has been noticed recently (Nov 2024) that puppet is not adjusting the password in some cases. If there are logs such as 'password authentication failed for user "netbox"' you can manually change the netbox user password (password is available in /etc/netbox/configuration.py):

sudo -i -u postgres
psql netbox
ALTER USER netbox WITH PASSWORD '<password>';

Start Netbox services after a restore

After the DB has been restored we can restart Netbox. If the restore was on the nebox-next host first run puppet to fix the DB password. SSH into the Netbox active host (as of June 2022 netbox1003) and execute:

sudo systemctl restart uwsgi-netbox.service
sudo systemctl restart rq-netbox.service
sudo systemctl status uwsgi-netbox.service
sudo systemctl status rq-netbox.service

Then check the logs in /srv/log/netbox/main.log and that netbox.wikimedia.org works properly. Check also the last item in the Netbox changelog section in the UI to ensure the data is correctly loaded.

Sanitizing a database dump

The Netbox database contains a few bits of sensitive information, and if it is going to be used for testing purposes in WMCS it should be sanitized first.

Create a copy of the main database createdb netbox-sanitize && pg_dump netbox | psql netbox-sanitize
Run the below SQL code on netbox-sanitize database.
Dump and drop database pg_dump netbox-sanitize > netbox-sanitized.sql; dropdb netbox-sanitize

THE BELOW COMMANDS ARE OUTDATED AND MIGHT NOT COVER EVERYTHING THAT NEEDS TO BE SANITIZED

-- truncate secrets
TRUNCATE secrets_secret CASCADE;
TRUNCATE secrets_sessionkey CASCADE;
TRUNCATE secrets_userkey CASCADE;

-- sanitize dcim_serial
UPDATE dcim_device SET serial = concat('SERIAL', id::TEXT);

-- truncate user table
TRUNCATE auth_user CASCADE;

-- sanitize dcim_interface.mac_address
UPDATE dcim_interface SET mac_address = CONCAT(
                   LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
                   LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
                   LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
                   LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
                   LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
                   LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0')) :: macaddr;

-- sanitize cricuits_circuit.cid
UPDATE circuits_circuit SET cid = concat('CIRCUIT', id::TEXT);

Netbox Extras

CustomScripts, reports (merged with scripts in the UI), validators and other associated tools for Netbox are collected in the netbox-extras repository at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/.

as a safeguard, after merging your change, its deployment is not fully automatic.

TL;DR; after merging your change use the sre.netbox.update-extras cookbook.

For Netbox DEV: sudo cookbook sre.netbox.update-extras --reason 'a good reason' -a netbox-canary
For Netbox PROD: sudo cookbook sre.netbox.update-extras --reason 'a good reason' -a netbox

If files in the "validators" directory are changed, the cookbook will also restart uwsgi-netbox

More details

Since Netbox 4, the repository needs to be deployed in 2 different locations (both on the frontends):

Through the local checkout of the git repository under /srv/deployment/netbox-extras, for validators, tools (Ganeti sync, dns sync) and shared common.py scripts code.
Through Netbox 's DataSource module, for scripts and reports, which ultimately copies them to /srv/netbox/customscripts/ after going through its internal DB.

Syncing the datasource by clicking on the Netbox's UI button only sync is on the primary Frontend.

Scripts and reports are dispatched between all frontends when ran. Make sure "netbox-extra" is in sync between all nodes.

Netbox (and source of truth) principles

Data automatically synced from the infrastructure should not drive the infrastructure
- It is to be used for information purposes (eg. VM disk space) or as support for original data (eg. server interfaces for cables/IP/dns_name)
All data manually entered will have entry mistakes
- Use helper scripts, input validation or post entry consistency checks (reports)
All data manually entered will go stale
- Refrain from adding data that will not drive the infrastructure

Netbox features

Custom Links

WebUI (defined there): https://netbox.wikimedia.org/extras/custom-links/

Doc: https://docs.netbox.dev/en/stable/models/extras/customlink/

Netbox allow to setup custom links to other websites using Jinja2 templating for both the visualized name and the actual link, allowing for quite some flexibility. The current setup (as of August 2024) has the following links:

Grafana (for all physical devices and VMs)
Icinga (for all physical devices and VMs)
AlertManager (for all physical devices and VMs)
Debmonitor (for all physical devices and VMs)
Procurement Ticket (only for physical devices that have a ticket that matches either Phabricator or RT)
Hardware config (for Dell and HP physical devices, pointing to the manufacturer page for warranty information based on their serial number)
LibreNMS (for Juniper, opengear and sentry devices)
Puppetboard (for all physical devices and VMs)

Reports

Reports are deprecated beginning with NetBox v4.0, and their functionality has been merged with custom scripts. While backward compatibility has been maintained, users are advised to convert legacy reports into custom scripts soon, as support for legacy reports will be removed in a future release.

WebUI (reports results): https://netbox.wikimedia.org/extras/scripts/

Doc: https://netboxlabs.com/docs/netbox/en/stable/customization/reports/

Defined in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/+/refs/heads/master/reports/

Netbox reports are a way of validating data within Netbox.

In summary, reports produce a series of log lines that indicate some status connected to a machine, and may be either error, warning, or success. Log lines with no particular disposition for information purposes may also be emitted. Log lines can be tied to any Netbox object for easier referring.

It is better to prevent the invalid data entry at the first place when possible (eg. regex, custom validation)

Report Conventions

Scripts and reports called by systemd timers use the local user sre_bot.

Because of limitations to the UI for Netbox reports, certain conventions have emerged:

Reports should emit one log_error line for each failed item. If the item doesn't exist as a Netbox object, None may be passed in place of the first argument.
If any log_warning lines are produced, they should be grouped after the loop which produces log_error lines.
Reports should emit one log_success which contains a summary of successes, as the last log in the report.
Log messages referring to a single object should be formatted like <verb/condition> <noun/subobject>[: <explanatory extra information>]. Examples:
1. malformed asset tag: WNF1212
2. missing purchase date
Summary log messages should be formatted like <count> <verb/condition> <noun/subobject>
If possible followed with a suggestion on how to fix it (for example what are the proper values)

Report Alert

Most reports that alert are data integrity mismatches due to changes in infrastructure, as a secondary check, and the responsibility of DC-ops.

Some (eg. network report) can have unforeseen consequences on the infrastructure (eg. miss-configurations).

Reports and their Errors
Report	Typical Responsibility	Alerts	Typical Error(s)	Note
Accounting	I/F or DC-ops	✅
Cables	DC-ops	✅
Coherence	DC-Ops	✅
LibreNMS	DC-ops or Netops	✅		You can ignore a LibreNMS device by setting its "ignore alert" flag in LibreNMS
Management	DC-ops	✅
PuppetDB	Whoever changed / reimaged host	✅	<device> missing from PuppetDB or <device> missing from Netbox. These occur because the data in PuppetDB does not match the data in Netbox, typically related to missing devices or unexpected devices. Generally these errors fix themselves once the reimage is complete, but the Netbox record for the host may need to be updated for decommissioning and similar operations.
Network	DC-ops or Netops	✅

Custom Scripts

WebUI: https://netbox.wikimedia.org/extras/scripts/

Doc: https://docs.netbox.dev/en/stable/customization/custom-scripts/

Defined in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/+/refs/heads/master/customscripts/

While Netbox reports are read-only and have a fixed output format, CustomScripts can both write to the database and provide custom output.

In our infrastructure they're used for those two aspects:

Abstract and automate data entry,
- Import Server Facts - imports host network information from PuppetDB into Netbox
- Move Server - moves server location from one place to another making necessary adjustments
- Provision Server Network - adds server to switch connection
- Offline_device - set a device to offline, removing it from rack and deleting network connections
- Replace_device - used to move all attributes from one device to another when being replaced
- add_secondary_ips - for hosts like Cassandra that require more than 1 IP on its primary interface
Format and expose data in a way that can be consumed by external tools,
- Capirca

The above scripts should probably be moved to the plugin feature.

When running a script that writes to the database, run it a first time with "Commit changes" unchecked.

Review the changes that would happen.

Then a second time with "Commit changes" checked to make the changes permanent.

Extra Errors, Notes and Procedures

Would like to remove interface

This error is produced in the Interface Automation script when cleaning up old interfaces during an import.

Interfaces are considered for removal if they don't appear in the list provided by the data source (generally speaking, PuppetDB); they are then checked if there is an IP address or a cable associated with the interface. If there is one of these the interface is left in place so as to not lose data. It is considered a bug if this happens, so if you see this error in an output feel free to open a ticket against #netbox in Phabricator.

Error removing interface after speed change

This error is produced in the Interface Automation script when cleaning up old interfaces when provisioning a server's network attributes.

Specifically for modular interfaces on Juniper devices, the interface name is determined by the speed of the interface, and the port number. If an old interface exists, say xe-1/0/8, on a modular port and we replace the 10G SFP+ with a 25G SFP28, the name of the interface will change to et-1/0/8. JunOS cannot have both defined so the import script will remove the old (xe-1/0/8) interface in Netbox before adding the new one.

This error will get thrown if the old interface still has a cable connected, or an IP address assigned. This shouldn't normally happen, but if it does the old interface should be manually removed, and cables/IPs cleaned up as necessary. Feel free to ping netops members on IRC if there is any confusion, or open a Phabricator task.

Jobs dispatching

When ran, a script (or report) is dispatched as a Redis job. The first frontend to pick it up (through the rq-netbox service) used to get to execute it.

However because of a limitation documented in T341843 rq-netbox now only runs on the primary frontend, which needs to be the frontend local to the Redis server.

Still, make sure that all frontends are in sync (the update-extras cookbook does the right thing).

Custom Fields

WebUI (defined there): https://netbox.wikimedia.org/extras/custom-fields/

Doc: https://docs.netbox.dev/en/stable/customization/custom-fields/

Please open a task for the I/F team if you need a new Custom Field.

Data sources

WebUI: https://netbox.wikimedia.org/core/data-sources/

See also the "netbox-extra" section of this document. Only the source named "Netbox extra" is synced with the netbox-extras cookbook.

nbshell

Not a user facing feature, but an admin feature, useful for troubleshooting.

Doc: https://docs.netbox.dev/en/stable/administration/netbox-shell/

This have the power to break things very quickly if not used carefully.

The bellow command will drop you in a python shell with access to all the Netbox models, similarly to what the CustomScripts use.

sudo -i
cd /srv/deployment/netbox && . venv/bin/activate && cd deploy/src/netbox && python manage.py nbshell

When performing changes it's ideal to make them show up in the Netbox changelog. That can be achieved with something like this:

import uuid
request_id = uuid.uuid4()
user = User.objects.get(username='my_username')

# When modifying an object save also the changes:
device = Device.objects.get(name='hostname')
device.comments = 'some comment'
# See the available choices in:
# https://github.com/netbox-community/netbox/blob/master/netbox/extras/choices.py#L81
log = device.to_objectchange('update')  # create/update/delete
log.request_id = request_id
log.user = user
log.save()
device.save()

For a create, add the log entry after the object creation. For a delete add it before the object deletion.

Tags/ConfigContext

Tags are a slippery slope as they are global and don't have built in mechanism to prevent typos. ConfigContext are much more difficult to audit than fields. We've so far managed to not need them. Therefore,

They MUST NOT be used in our environment.

If you think you need one, please open a task for the I/F team to discuss it.

Housekeeping

A systemd timer runs once a day to perform background cleanup of expired data. More details on https://docs.netbox.dev/en/stable/administration/housekeeping/

Validators

Tracked in https://phabricator.wikimedia.org/T310590

Unlike reports which only trigger after the erroneous change was made, validators ensure that the data entered (using the UI or the API) respect our custom ("business") rules.

Reports should only be used when validators are not suitable (eg. using external tool, sequence of changes).

Our convention is to have a single validator file per Netbox model, each of them have a single class class Main(CustomValidator). Per Netbox requirements, this class MUST have a function named def validate(self, instance).

To "activate" the validator, add the model to the relevant profile::netbox::validators key (prod or dev).

A few things to keep in mind when working on validators:

Bugs in the validators code will return an error 500 when the user will try to interact with Netbox, which could make them think that there is something wrong with Netbox itself.
When editing Netbox through the API, a validator failure will return an error 400
Validators are run for all modifications, please make sure your code is as lean as possible, with the least amount of dependencies

Testing validators

To test a validator on all the existing objects in Netbox you can follow those steps on a Netbox frontend host:

$ sudo -i -u netbox
$ cd /srv/deployment/netbox && . venv/bin/activate && cd deploy/src/netbox
$ sudo vi test_validator.py
### Paste the validator to test in the file
$ python manage.py nbshell
# [...SNIP...] also removing the >>> prefixes for easy copy-pasting
from test_validator import Main
v = Main()
# Adjust the model based on the object you want to test, for example for IPAddresses:
obj_type = IPAddress
for obj in obj_type.objects.all():
    try:
        v.validate(obj, None)
    except Exception as e:
        print(f"{obj} - {e}")

[Ctrl+d]
$ rm test_validator.py
$ tree  __pycache__/
__pycache__/
└── test_validator.cpython-39.pyc
$ rm -rf __pycache__

DON'T FORGET TO REMOVE THE TEST FILE AND THE CACHED ONE

Journaling

Netbox doc: https://netboxlabs.com/docs/netbox/en/stable/features/journaling/

How to use in Netbox scripts:

from extras.models import JournalEntry

JournalEntry.objects.create(assigned_object=my_object,comments='a comment', kind='info')

Where my_object can be any kind of Netbox object (interface, device, circuit, etc), and kind is defined in JournalEntryKindChoices (as of today: 'info' (default if not specified), 'danger', 'success', 'warning'. It's also possible to pass "created_by" a Netbox user, but this is set by default in Netbox scripts. How to use in pynetbox:

api.extras.journal_entries.create(assigned_object_type='<object_type>',assigned_object_id=<object_id>, comments='a comment', kind='info', created_by=<user_id>)

Comments and kind are similar to above
The object can't be passed directly to the create function, it needs to be split in type/id. For example assigned_object_type='dcim.device',assigned_object_id=d.id
- afaik there is no way to programmatically retrieve the assigned the object_type from a given object.
By default, the journal entry will be written as the API key's owner, so usually "sre_bot". It's possible to specify a different user with for example: created_by=api.users.users.get(username='ayounsi').id

Exports

Set of resources that exports Netbox data in various formats.

DNS

A git repository of DNS zonefile snippets generated from Netbox data and exported via HTTPS in read-only mode to be consumed by the DNS#Authoritative_nameservers and the Continuous Integration tests run for the operations/dns Gerrit repository. The repository is available via:

 $ git clone https://netbox-exports.wikimedia.org/dns.git

To update the repository, see DNS/Netbox#Update_generated_records.

The repository is also mirrored in Phabricator: https://phabricator.wikimedia.org/source/netbox-exported-dns/ though it may not be immediately up-to-date.

Puppet

some of the information in netbox is useful in puppet for example

host rack location
hosts managment ip address
hosts status
network devices
network prefixes

In order to make this information available to puppet we have create the sre.puppet.sync-netbox-hiera cookbook which preforms the following actions.

uses graphQL to retrieve the necessary data
publishes the information to git via https://netbox-exports.wikimedia.org/netbox-hiera/
syncs the data to all netbox and cumin hosts
syncs the data to the puppetmaster hosts

At this point the data is available to puppet via the hiera entry for hosts and the common section and can be looked up with the normal hiera lookup methods e.g. to get the host location run lookup('profile::netbox::host::location')

In order to make it easier for users to consume this data we pre load it via dedicated profiles. As such the better way to load data is to include the specific class and then access the data.

include profile::netbox::host
if profile::netbox::host::location['rack'] == 'D3' {
  fail("${facts['networking']['hostname']} should not be in rack D3")
}

We also store bulk information in hiera related to the network and devices not managed by puppet e.g. network devices or management interfaces. This data is mostly useful for monitoring however in future it may replace the current uses of network::constants. you can load this data as follow however please not it is a lot of data and should only be included if needed

include profile::netbox::data
$profile::netbox::data::mgmt.each |$host, $data| {
  if $data['rack'] == 'D3' {
    fail("${host} should not be in rack D3")
  }
}
$profile::netbox::data::network_devices.each |$host, $data| {
  notice("${host} is a ${data['role']}")
}
$profile::netbox::data::prefixes.each |$prefix, $data| {
  notice("${prefix} (${data['description']}) is in vlan ${data['vlan']}")
}

Prometheus

The netbox-more-metrics plugin adds custom metrics (eg. about devices statistics) to the main /metrics Prometheus endpoint.

Which is used to generate https://grafana.wikimedia.org/d/ppq_8SRMk/netbox-device-statistic-breakdown?orgId=1

Imports

Ganeti sync

Refactor and improvements (eg. cluster_group support) in T262446

For each entry under the profile::netbox::ganeti_sync_profiles Hiera key, Puppet created a systemd timer on the active sever to run ganeti-netbox-sync.py with the matching parameters.

Ganeti and Netbox use conflicting and confusing naming, see Ganeti#Netbox naming disambiguation

External scripts

Scripts and tools not previously listed that interact with Netbox, and thus need to be checked for compatibility after significant Netbox changes (eg. upgrades).

Homer

Spicerack

Cookbooks with direct pynetbox calls:

sre.pdus.uptime
sre.pdus.rotate-snmp
sre.network.configure-switch-interfaces
sre.network.peering
sre.hosts.dhcp
sre.pdus.rotate-password
sre.hosts.provision
sre.hosts.reimage
sre.pdus.reboot-and-wait

Cookbooks using GraphQL

sre.puppet.sync-hiera

Pynetbox

Our in house package of pynetbox is now build using the CI pipeline: Debian packaging with dgit and CI

And hosted in https://gitlab.wikimedia.org/repos/sre/pynetbox

Current fleet wide version status: https://debmonitor.wikimedia.org/packages/python3-pynetbox

We also use pynetbox running in various virtualenv like Homer or Netbox itself.

Upgrading Netbox

There are 3 types of upgrades:

Simple upgrade, only Netbox is upgraded
- extremely simple procedure as within patch-level releases they maintain a reasonable level of compatibility in the APIs that we use. When it comes to upgrading across minor versions, breaking changes may have occurred and careful reading of changelogs is needed, as well as testing of the scripts which consume and manipulate data in Netbox
Infrastructure upgrade, where new servers running a newer Netbox are built in parallel of production then switched over
- Much more complex, see the Netbox 4 upgrade task for context
Server refresh, Netbox stays at the same version but the underlying servers are refreshed (eg. new Debian version)
- Never done, probably through failing over the backup server and failing back on the new one, or as infrastructure upgrade

Simple upgrade

Review changelog and note any changes that may interact with our integrations or deployment
Update Netbox repository
Update netbox-deploy repository
Deploy to netbox-dev200x
Tests
Review UI and note any differences to call out during announcement
(if breaking changes) Port scripts
(if breaking changes) Test scripts
Deploy to production

Preparation

Check upstream changelog for any possible breaking changes (usually for major version change only)
- Update Puppet configuration.py and/or scripts/reports/3rd party scripts accordingly
Check the upgrade.sh history and update netbox-deploy:Makefile.deploy accordingly in the dev branch.
Check that the plugins (as of today only "netbox-more-metrics") we use are compatible with the target Netbox version
Depending on the upgrade being performed, it could be preferable to upgrade all the dependencies (see "Build deploy repository" below) ahead of time to reduce the number of variables during the upgrade itself.

Update WMF Netbox repository

git clone https://github.com/netbox-community/netbox.git
git remote add gerrit ssh://<YOUR_GERRIT_USERNAME>@gerrit.wikimedia.org:29418/operations/software/netbox
git checkout master
git push gerrit master
git push --tags gerrit master

Build deploy repository

Netbox has a deployment repository with the collected artifacts (the virtual environment and associated libraries) is used to deploy it. This is updated separately from our branch of Netbox with the following procedure which uses the operations/software/netbox-deploy repository.

In a working copy of operations/software/netbox-deploy, update the src/ subdirectory, which is a submodule of this repository pointing at WMF copy of Netbox Github repository; to do this git pull in that directory, and then check out the tag of the version that is being updated to, for example git checkout v4.0.7.
Build the artifacts by doing a make clean and then make all. This uses docker to collect all of the required libraries as specified in the various requirements.txt files. It creates the artifacts as artifacts/artifacts.bookworm.tar.gz and frozen-requirements-bookworm.txt.
Commit the changes to the repository and submit for review, be sure the following files have changes: frozen-requirements-bookworm.txt, artifacts/artifacts.bookworm.tar.gz, src.

Once the repository is reviewed and merged via Gerrit, it is ready to deploy!

Deploy to Testing Server

The next phase, even for simple upgrades is to deploy to netbox-dev2003.codfw.wmnet for basic testing prior to deploying to production.

Login to a deploy server such as deploy1002.eqiad.wmnet
Go to /srv/deployment/netbox/deploy; this is a check out of the -deploy repository from above.
Pull to the latest version, make sure you're on the good branch (currently main) with git pull origin main
Update the submodule in src : git submodule sync; git submodule update; and verify it's at the good commit with cd src/; git log -1 check that the /deploy directory doesn't have any outstanding changes in git status

Deploy netbox-dev2003, with bug reference in hand running the cookbook:

sudo cookbook sre.deploy.python-code -t T12345 -r 'Release v4.0.6 to netbox-next' -u netbox netbox 'A:netbox-canary'

This process should go smoothly and leave the target machine ready to test.
Deploy a new production database dump to netbox-dev2003's database to ensure parity with production.

Testing

Even for minor upgrades it's recommended to test as much features and code paths as possible, especially related to APIs and external tools. Thinks to look for are obviously errors, but also longer run time, or different data when possible.

You can cherry pick the tests you want to run depending on the changelog and the trust you have in Netbox.

Note that all production features are not used on -next, and some behaviors might not be visible there as well, for example related to the active/passive setup.

On https://netbox-next.wikimedia.org

Test login, make sure your username is correct (and not an uuid)
Test scrips and reports (Customization->Scripts) and compare to production (run without error, doesn't take significantly longer, etc)
Look at some samples of Devices, IPs, VMs, etc and compare to production
Make sure any plugin is showing in the UI
Test a manual sync of the Netbox extras repo
Review background queues for any stuck or failed one
Test validators, eg. by trying to create a device with an invalid asset tag, or an enabled switch interface with a wrong MTU

On netbox-dev2003.codfw.wmnet

Run the netbox_housekeeping service and check for proper execution
Test the Ganeti VM sync script, for example by:
- Temporarily adding the profile:codfw_test stanza to /etc/netbox/ganeti-sync.cfg
```
[profile:codfw_test]
site=codfw
cluster=ganeti-test01.svc.codfw.wmnet
```
- Run sudo -u netbox /srv/deployment/netbox/venv/bin/python3 /srv/deployment/netbox-extras/tools/ganeti-netbox-sync.py codfw_test
- Optionally delete a VM from Netbox-next and verify that it's properly re-added
Test the DNs generation script, sudo -u netbox python3 /srv/deployment/netbox-extras/dns/generate_dns_snippets.py -v commit "test"
Check the logs for anything unexpected (in syslog and /srv/log/netbox/main.log
Check that the Django/Netbox metrics are properly exposed to Prometheus

On your computer

If possible and if you have the various tools checked out locally. Then to be re-run from cumin hosts once the migration is over.

Run Homer with a diff or generate action with your config file pointing to netbox-next. This should produce no diff if the database is updated to production contents
As much as realistically possible, run Spicerack and the cookbooks listed on Netbox#External scripts in dry-run while pointing the config to netbox-next.

Production testing

-next have some differences with the prod instance, especially that it runs as a standalone instance (all frontend, DB and Redis are on the same host). In production it's needed to check that the Jobs runs are properly dispatched by Redis (in the "Background Queues" page).
Force run the various Ganeti sync and reports systemd timers
Run the dns cookbook as dry-run
Run the Netbox Hiera cookbook as dry run
Check Netbox#Prometheus graphs
Check overall monitoring (Alertmanager/Icinga)

Porting and Testing

This covers both expected (in the changelog) and unexpected (eg. if we use changes. If possible, write patches in a backward compatibility way and deploy them ahead of the production infra upgrade.

Once porting is thought to be complete, the changes should be deployed to netbox-dev2003, Spicerack, Homer, etc, and a full run of testing should be done to verify that the changes made fix the problems that turned up in initial testing, including attempting to go down avenues of execution that may not normally be hit.

If it's not possible/convenient to write the patches in a backward compatible way, the upgrade timeline is critical, and the new patches will need to be deployed in the same maintenance window as the production upgrade, and possibly re-worked if any issue shows up in production.

Deploy to Production

After a final run through of any problem areas exposed in above testing and fixes are deployed to netbox-dev2003 it is finally time to deploy the new version to production, with the following procedure:

Announce that the release will be occuring on #wikimedia-dcops, #mediawiki_security, #wikimedia-operations, #wikimedia-sre and, if necessary, coordinate a time when integration tools or DCops work will not be interrupted.
Merge any outstanding changes to netbox-extras or netbox/deploy repositories (if necessary).
On netboxdb[1-2]2003, perform a manual dump of the database.
Deploy netbox-extras to production using cumin, as in #Netbox_Extras
Announce on IRC that a deploy is happening, on #wikimedia-operations: !log Deploying Netbox v4... to production Tbug
Deploy to production:
1. Login to a Deployment server
2. Go to /srv/deployment/netbox/deploy; this is a check out of the -deploy repository from above.
3. Pull to the latest version, and update the submodule in src by pulling and checking out the tag that is going to be deployed.
4. Deploy with the cookbook with bug reference in hand: sudo cookbook sre.deploy.python-code -r 'Release v4.xx to production' -u netbox netbox 'A:netbox'
5. Make sure no issues happened during the deploy (eg. DB migration schema)
Perform testing as above, and in general make sure everything is as expected
Announce (and !log) the end of the upgrade

Troubleshooting

It's possible to run scripts and reports through the command line, for example:

python3 manage.py runscript interface_automation.ImportPuppetDB --data '{"device": "ml-cache1003"}' --commit

CablePath matching query does not exist

If getting the following error when saving a cable through a script, after deleting a cable:

dcim.models.cables.CablePath.DoesNotExist: CablePath matching query does not exist

Make sure you're running i.refresh_from_db() where i is the interface(s) where the previous cable was attached to.

Datasource sync: `AttributeError: 'NoneType' object has no attribute 'sync'`

Netbox 4. This only happened once, root cause unknown. Full stacktrace in syslog. We should open an upstream issue if it happens again.

Open nbshell

Run to find the problematic object. It should contain None in the line.

asr = AutoSyncRecord.objects.all()
for idx, a in enumerate(asr):
    print(f'{idx} - {a} - {a.object}')

Delete it with asr[XXX].delete() where XXX is its list index.

Known issues Future improvements

Phabricator project - https://phabricator.wikimedia.org/tag/netbox/

Improve our infrastructure modeling

Make more extensive use of Netbox custom fields - task T305126

Improve automation and reduce tech debt

Netbox: investigate GraphQL API - task T310577
Netbox: use Journaling feature - task T310583
Netbox: basic change rollback - task T310589
Netbox in codfw slowness issue (path to active/active) - task T330883

History

At Wikimedia it was evaluated in T170144 as a replacement for Racktables.
In T199083 the actual migration between the systems took place
T266487 - Netbox 2.9 upgrade
T288515 - Netbox vs. Nautobot
T296452 - documents the large upgrade from 2.10 to 3.2 and the subsequent improvements it brought
T314933 Upgrade Netbox to latest 3.2
T336275 - Upgrade Netbox to 4.x
T371889 - Upgrade Netbox to 4.1

External link

https://netbox.wikimedia.org (restricted)

Web UI

API

Staging

Production infrastructure

Monitoring

Icinga

Frontends

Databases

Prometheus

Logstash

Failover

Frontends

Databases

Database

Restore

PostgresSQL

Dumps backups

Stop the Netbox services before restore

Restore the DB dump

Start Netbox services after a restore

Sanitizing a database dump

Netbox Extras

More details

Netbox (and source of truth) principles

Netbox features

Custom Links

Reports

Report Conventions

Report Alert

Custom Scripts

Extra Errors, Notes and Procedures

Would like to remove interface

Error removing interface after speed change

Jobs dispatching

Custom Fields

Data sources

nbshell

Tags/ConfigContext

Housekeeping

Validators

Testing validators

Journaling

Exports

DNS

Puppet

Prometheus

Imports

Ganeti sync

External scripts

Pynetbox

Upgrading Netbox

Simple upgrade

Preparation

Update WMF Netbox repository

Build deploy repository

Deploy to Testing Server

Testing

On https://netbox-next.wikimedia.org

On netbox-dev2003.codfw.wmnet

On your computer

Production testing

Porting and Testing

Deploy to Production

Troubleshooting

CablePath matching query does not exist

Datasource sync: AttributeError: 'NoneType' object has no attribute 'sync'

Known issues Future improvements

Improve our infrastructure modeling

Improve automation and reduce tech debt

History

See also

External link

Datasource sync: `AttributeError: 'NoneType' object has no attribute 'sync'`