Jump to content

Netbox

From Wikitech

Netbox is used by Wikimedia as a tool for data center infrastructure management (DCIM) and IP address management (IPAM). It also serves as an integration point for switch and port management, DNS management, and other network operations.

Web UI

API

Staging

  • It consists of a single bookworm VM (netbox-dev2003) combining frontend, redis and database
  • Reachable on netbox-next.wikimedia.org and netbox-next.discovery.wmnet
    • Behind caches, similarly to the prod infrastructure
  • It's data comes from a manual dump of production's database
    • Reach out to Infrastructure Foundations if you need a more fresh database
    • Be careful not to leak any of its data
  • It is used to test Netbox upgrades, scripts, reports, etc
  • This host is active in monitoring (with notifications disabled)
    • As such, make sure that all the alerts cleared after your tests

Production infrastructure

the production Netbox infrastructure consists of 4 bullseye VMs (see all Netbox VMs):

  • 2 active/passive frontends (netboxXXXX)
  • 2 primary/replica postgresSQL databases (netboxdbXXXX)

By default the active/primary servers are the eqiad ones.

The public endpoint is behind our CDN so the request flow is:

  1. CDN - (using the wildcard *.wikimedia.org as its TLS certificate)
  2. active frontends
    1. Apache (using cfssl for its TLS certificate)
    2. Django app (through uwsgi)
  3. Active database

Monitoring

Icinga

See all Netbox related Icinga checks: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=netbox

In addition to the regular set of VM checks that are run on all servers, there are Icinga checks that only run on the active servers.

Controlled by the profile::netbox::db::primary and profile::netbox::active_server Hiera keys.

Frontends

Controlled by the profile::netbox::active_server Hiera key:

  • Alerting for the Ganeti sync systemd timers (are they running correctly?) - see also Netbox#Ganeti sync
  • Alerting for the Netbox reports (is there invalid data in Netbox?) - see also Netbox#Reports
  • Alerting for the DNS export automation (are there Uncommitted DNS changes in Netbox?) - See also Netbox#DNS

Databases

The replica have a check for replication delay.

Prometheus

Setup task: https://phabricator.wikimedia.org/T243928

Global health overview (beta): https://grafana.wikimedia.org/d/DvXT6LCnk/

Logstash

https://logstash.wikimedia.org/app/dashboards#/view/AXB84iDRKWrIH1QRIR_j

Failover

Frontends

Using confctl, pool the passive server and depool the previous active one.

confctl --object-type discovery select 'dnsdisc=netbox,name=codfw' set/pooled=true
confctl --object-type discovery select 'dnsdisc=netbox,name=eqiad' set/pooled=false

If the failover is going to last (eg. longer than a server reboot), change the profile::netbox::active_server Hiera key to the backup server. This will ensure the cron/systemd timers as well as Icinga checks are running.

Note that having the active frontend in a different datacenter than the primary database will result in Netbox being slower.

Databases

If the primary database server needs a short downtime it's recommended to not try a failover and instead have Netbox offline for a short amount of time.

There are currently no documented procedure on how to fail the database over, and even less how to fail back to the former primary.

See also Postgres

Database

Restore

First of all analyze the Netbox changelog to choose what's the best action to perform a restore.

The general options are:

  • Manually (or via the API) re-play the actions listed in the changelog in reverse order. The changelog entries don't have full raw data, some of them might show the names instead of the IDs required in the API
  • Restore a database dump. This ensure consistency at a given point in time, and could even be used to perform some partial restore using pg_restore.

To restore files from Bacula back to the client, use bconsole on helium and refer to Bacula#Restore_(aka_Panic_mode) for detailed steps.

PostgresSQL

Dumps backups

On the database servers, a Puppetized systemd timer (class postgresql::backup) automatically creates a daily dumps file of all local Postgres databases (pg_dumpall) and stores it in /srv/postgres-backup

  • On primary node: daily dumps
  • On secondary node: hourly dumps

This path is then backed up by Bacula#Adding a new client

For more details, the related subtask to setup backups was Phab:T190184, improved in task T262677

Stop the Netbox services before restore

Postgres may prevent us from deleting the current database before the restore if there are active remote connections to it. This is usually not the case but it has been observed on occasion. To minimize the chance of it happening we should stop the Netbox services on primary netbox host (as of June 2022 netbox1002) prior to restoring the db.

First downtime the active netbox server with the downtime cookbook from a cumin host:

sudo cookbook sre.hosts.downtime --minutes 30 -r "Restoring DB from backup on netboxdb1002" -t <task> netbox1002.eqiad.wmnet

Then on the active netbox host itself:

sudo systemctl stop rq-netbox
sudo systemctl stop uwsgi-netbox.service

Restore the DB dump

NOTE: The below instructions are valid to copy live db dump to 'netbox-next', the postgress db runs locally on that host rather than connecting to a dedicated DB VM. Also see note below about running puppet afterwards to change the netbox db user password (which is different on the dev host so should be changed if we dump the db from live one).

  • Check the dump files on the secondary DB host (as of Dec. 2022 netboxdb2003) in /srv/postgres-backup, if any issue with those files, do the same on the primary host. The secondary host performs hourly backups while the primary only daily.
  • If the secondary host has a more newer backup:
    • Copy the dump to the primary DB host (as of Dec. 2022 netboxdb1003), you can use from one of the cumin hosts (cumin1002.eqiad.wmnet, cumin2002.codfw.wmnet) as root, run:
      SSH_AUTH_SOCK=/run/keyholder/proxy.sock scp -3 root@netboxdb2003.codfw.wmnet:/srv/postgres-backup/psql-all-dbs-latest.sql.gz root@netboxdb1003.eqiad.wmnet:/srv/
      
    • SSH into the primary DB host (as of Nov. 2024 netboxdb1003)
    • Change the permissions of the copied backup to be owned by postgres:postgres
  • Take a one-off backup on the primary DB host (as of Dec. 2022 netboxdb1003) right before starting the restore with (the .bak suffix is important to not be auto-evicted):
    $ su - postgres
    $ /usr/bin/pg_dumpall | /bin/gzip > /srv/postgres-backup/${USER}-DESCRIPTION.psql-all-dbs-$(date +\%Y\%m\%d).sql.gz.bak
    
  • Become postgres user:
    sudo -i -u postgres
    
  • Connect to the DB, list and drop the Netbox database:
$ psql
postgres=# \l
...
postgres=# DROP DATABASE netbox;
DROP DATABASE
postgres=#
# NOTE - you may still get a message saying 'database "netbox" is being accessed by other users' which prevents you dropping the db.  These can be from the active netbox host, running reports triggered by systemd timers, and the backup db host.  It is probably easiet to wait until these complete and re-try, if it cannot wait probably the services/processes connecting from those remote hosts need to be stopped.  As a last resort 'DROP DATABASE "netbox" WITH(FORCE);' can be used.
  • Still as the postgres user, restore the DB with:
$ gunzip < /srv/psql-all-dbs-SOME_DATE.sql.gz | /usr/bin/psql
  • DEV Host password

If the dump has been restored to the DEV host hosting netbox-next, run puppet to fix the netbox DB user password at this time. NOTE: It has been noticed recently (Nov 2024) that puppet is not adjusting the password in some cases. If there are logs such as 'password authentication failed for user "netbox"' you can manually change the netbox user password (password is available in /etc/netbox/configuration.py):

sudo -i -u postgres
psql netbox
ALTER USER netbox WITH PASSWORD '<password>';


Start Netbox services after a restore

After the DB has been restored we can restart Netbox. If the restore was on the nebox-next host first run puppet to fix the DB password. SSH into the Netbox active host (as of June 2022 netbox1003) and execute:

sudo systemctl restart uwsgi-netbox.service
sudo systemctl restart rq-netbox.service
sudo systemctl status uwsgi-netbox.service
sudo systemctl status rq-netbox.service

Then check the logs in /srv/log/netbox/main.log and that netbox.wikimedia.org works properly. Check also the last item in the Netbox changelog section in the UI to ensure the data is correctly loaded.

Sanitizing a database dump

The Netbox database contains a few bits of sensitive information, and if it is going to be used for testing purposes in WMCS it should be sanitized first.

  1. Create a copy of the main database createdb netbox-sanitize && pg_dump netbox | psql netbox-sanitize
  2. Run the below SQL code on netbox-sanitize database.
  3. Dump and drop database pg_dump netbox-sanitize > netbox-sanitized.sql; dropdb netbox-sanitize

THE BELOW COMMANDS ARE OUTDATED AND MIGHT NOT COVER EVERYTHING THAT NEEDS TO BE SANITIZED

-- truncate secrets
TRUNCATE secrets_secret CASCADE;
TRUNCATE secrets_sessionkey CASCADE;
TRUNCATE secrets_userkey CASCADE;

-- sanitize dcim_serial
UPDATE dcim_device SET serial = concat('SERIAL', id::TEXT);

-- truncate user table
TRUNCATE auth_user CASCADE;

-- sanitize dcim_interface.mac_address
UPDATE dcim_interface SET mac_address = CONCAT(
                   LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
                   LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
                   LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
                   LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
                   LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
                   LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0')) :: macaddr;

-- sanitize cricuits_circuit.cid
UPDATE circuits_circuit SET cid = concat('CIRCUIT', id::TEXT);

Netbox Extras

CustomScripts, reports (merged with scripts in the UI), validators and other associated tools for Netbox are collected in the netbox-extras repository at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/.

as a safeguard, after merging your change, its deployment is not fully automatic.

TL;DR; after merging your change use the sre.netbox.update-extras cookbook.

  • For Netbox DEV: sudo cookbook sre.netbox.update-extras --reason 'a good reason' -a netbox-canary
  • For Netbox PROD: sudo cookbook sre.netbox.update-extras --reason 'a good reason' -a netbox
If files in the "validators" directory are changed, the cookbook will also restart uwsgi-netbox

More details

Since Netbox 4, the repository needs to be deployed in 2 different locations (both on the frontends):

  • Through the local checkout of the git repository under /srv/deployment/netbox-extras, for validators, tools (Ganeti sync, dns sync) and shared common.py scripts code.
  • Through Netbox 's DataSource module, for scripts and reports, which ultimately copies them to /srv/netbox/customscripts/ after going through its internal DB.
Syncing the datasource by clicking on the Netbox's UI button only sync is on the primary Frontend.
Scripts and reports are dispatched between all frontends when ran. Make sure "netbox-extra" is in sync between all nodes.

Netbox (and source of truth) principles

  • Data automatically synced from the infrastructure should not drive the infrastructure
    • It is to be used for information purposes (eg. VM disk space) or as support for original data (eg. server interfaces for cables/IP/dns_name)
  • All data manually entered will have entry mistakes
    • Use helper scripts, input validation or post entry consistency checks (reports)
  • All data manually entered will go stale
    • Refrain from adding data that will not drive the infrastructure

Netbox features

WebUI (defined there): https://netbox.wikimedia.org/extras/custom-links/

Doc: https://docs.netbox.dev/en/stable/models/extras/customlink/

Netbox allow to setup custom links to other websites using Jinja2 templating for both the visualized name and the actual link, allowing for quite some flexibility. The current setup (as of August 2024) has the following links:

  • Grafana (for all physical devices and VMs)
  • Icinga (for all physical devices and VMs)
  • AlertManager (for all physical devices and VMs)
  • Debmonitor (for all physical devices and VMs)
  • Procurement Ticket (only for physical devices that have a ticket that matches either Phabricator or RT)
  • Hardware config (for Dell and HP physical devices, pointing to the manufacturer page for warranty information based on their serial number)
  • LibreNMS (for Juniper, opengear and sentry devices)
  • Puppetboard (for all physical devices and VMs)

Reports

Reports are deprecated beginning with NetBox v4.0, and their functionality has been merged with custom scripts. While backward compatibility has been maintained, users are advised to convert legacy reports into custom scripts soon, as support for legacy reports will be removed in a future release.

WebUI (reports results): https://netbox.wikimedia.org/extras/scripts/

Doc: https://netboxlabs.com/docs/netbox/en/stable/customization/reports/

Defined in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/+/refs/heads/master/reports/

Netbox reports are a way of validating data within Netbox.

In summary, reports produce a series of log lines that indicate some status connected to a machine, and may be either error, warning, or success. Log lines with no particular disposition for information purposes may also be emitted. Log lines can be tied to any Netbox object for easier referring.

It is better to prevent the invalid data entry at the first place when possible (eg. regex, custom validation)

Report Conventions

Scripts and reports called by systemd timers use the local user sre_bot.

Because of limitations to the UI for Netbox reports, certain conventions have emerged:

  1. Reports should emit one log_error line for each failed item. If the item doesn't exist as a Netbox object, None may be passed in place of the first argument.
  2. If any log_warning lines are produced, they should be grouped after the loop which produces log_error lines.
  3. Reports should emit one log_success which contains a summary of successes, as the last log in the report.
  4. Log messages referring to a single object should be formatted like <verb/condition> <noun/subobject>[: <explanatory extra information>]. Examples:
    1. malformed asset tag: WNF1212
    2. missing purchase date
  5. Summary log messages should be formatted like <count> <verb/condition> <noun/subobject>
  6. If possible followed with a suggestion on how to fix it (for example what are the proper values)

Report Alert

Most reports that alert are data integrity mismatches due to changes in infrastructure, as a secondary check, and the responsibility of DC-ops.

Some (eg. network report) can have unforeseen consequences on the infrastructure (eg. miss-configurations).

Reports and their Errors
Report Typical Responsibility Alerts Typical Error(s) Note
Accounting I/F or DC-ops
Cables DC-ops
Coherence DC-Ops
LibreNMS DC-ops or Netops You can ignore a LibreNMS device by setting its "ignore alert" flag in LibreNMS
Management DC-ops
PuppetDB Whoever changed / reimaged host <device> missing from PuppetDB or <device> missing from Netbox. These occur because the data in PuppetDB does not match the data in Netbox, typically related to missing devices or unexpected devices. Generally these errors fix themselves once the reimage is complete, but the Netbox record for the host may need to be updated for decommissioning and similar operations.
Network DC-ops or Netops

Custom Scripts

WebUI: https://netbox.wikimedia.org/extras/scripts/

Doc: https://docs.netbox.dev/en/stable/customization/custom-scripts/

Defined in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/+/refs/heads/master/customscripts/

While Netbox reports are read-only and have a fixed output format, CustomScripts can both write to the database and provide custom output.

In our infrastructure they're used for those two aspects:

  • Abstract and automate data entry,
    • Import Server Facts - imports host network information from PuppetDB into Netbox
    • Move Server - moves server location from one place to another making necessary adjustments
    • Provision Server Network - adds server to switch connection
    • Offline_device - set a device to offline, removing it from rack and deleting network connections
    • Replace_device - used to move all attributes from one device to another when being replaced
    • add_secondary_ips - for hosts like Cassandra that require more than 1 IP on its primary interface
  • Format and expose data in a way that can be consumed by external tools,
    • Capirca


The above scripts should probably be moved to the plugin feature.

When running a script that writes to the database, run it a first time with "Commit changes" unchecked.

Review the changes that would happen.

Then a second time with "Commit changes" checked to make the changes permanent.

Extra Errors, Notes and Procedures

Would like to remove interface

This error is produced in the Interface Automation script when cleaning up old interfaces during an import.

Interfaces are considered for removal if they don't appear in the list provided by the data source (generally speaking, PuppetDB); they are then checked if there is an IP address or a cable associated with the interface. If there is one of these the interface is left in place so as to not lose data. It is considered a bug if this happens, so if you see this error in an output feel free to open a ticket against #netbox in Phabricator.

Error removing interface after speed change

This error is produced in the Interface Automation script when cleaning up old interfaces when provisioning a server's network attributes.

Specifically for modular interfaces on Juniper devices, the interface name is determined by the speed of the interface, and the port number. If an old interface exists, say xe-1/0/8, on a modular port and we replace the 10G SFP+ with a 25G SFP28, the name of the interface will change to et-1/0/8. JunOS cannot have both defined so the import script will remove the old (xe-1/0/8) interface in Netbox before adding the new one.

This error will get thrown if the old interface still has a cable connected, or an IP address assigned. This shouldn't normally happen, but if it does the old interface should be manually removed, and cables/IPs cleaned up as necessary. Feel free to ping netops members on IRC if there is any confusion, or open a Phabricator task.

Jobs dispatching

When ran, a script (or report) is dispatched as a Redis job. The first frontend to pick it up (through the rq-netbox service) used to get to execute it.

However because of a limitation documented in T341843 rq-netbox now only runs on the primary frontend, which needs to be the frontend local to the Redis server.

Still, make sure that all frontends are in sync (the update-extras cookbook does the right thing).

Custom Fields

WebUI (defined there): https://netbox.wikimedia.org/extras/custom-fields/

Doc: https://docs.netbox.dev/en/stable/customization/custom-fields/

Please open a task for the I/F team if you need a new Custom Field.

Data sources

WebUI: https://netbox.wikimedia.org/core/data-sources/

See also the "netbox-extra" section of this document. Only the source named "Netbox extra" is synced with the netbox-extras cookbook.

nbshell

Not a user facing feature, but an admin feature, useful for troubleshooting.

Doc: https://docs.netbox.dev/en/stable/administration/netbox-shell/

This have the power to break things very quickly if not used carefully.

The bellow command will drop you in a python shell with access to all the Netbox models, similarly to what the CustomScripts use.

sudo -i
cd /srv/deployment/netbox && . venv/bin/activate && cd deploy/src/netbox && python manage.py nbshell

When performing changes it's ideal to make them show up in the Netbox changelog. That can be achieved with something like this:

import uuid
request_id = uuid.uuid4()
user = User.objects.get(username='my_username')

# When modifying an object save also the changes:
device = Device.objects.get(name='hostname')
device.comments = 'some comment'
# See the available choices in:
# https://github.com/netbox-community/netbox/blob/master/netbox/extras/choices.py#L81
log = device.to_objectchange('update')  # create/update/delete
log.request_id = request_id
log.user = user
log.save()
device.save()

For a create, add the log entry after the object creation. For a delete add it before the object deletion.

Tags/ConfigContext

Tags are a slippery slope as they are global and don't have built in mechanism to prevent typos. ConfigContext are much more difficult to audit than fields. We've so far managed to not need them. Therefore,

They MUST NOT be used in our environment.

If you think you need one, please open a task for the I/F team to discuss it.

Housekeeping

A systemd timer runs once a day to perform background cleanup of expired data. More details on https://docs.netbox.dev/en/stable/administration/housekeeping/

Validators

Tracked in https://phabricator.wikimedia.org/T310590

Unlike reports which only trigger after the erroneous change was made, validators ensure that the data entered (using the UI or the API) respect our custom ("business") rules.

Reports should only be used when validators are not suitable (eg. using external tool, sequence of changes).

Our convention is to have a single validator file per Netbox model, each of them have a single class class Main(CustomValidator). Per Netbox requirements, this class MUST have a function named def validate(self, instance).

To "activate" the validator, add the model to the relevant profile::netbox::validators key (prod or dev).

A few things to keep in mind when working on validators:

  • Bugs in the validators code will return an error 500 when the user will try to interact with Netbox, which could make them think that there is something wrong with Netbox itself.
  • When editing Netbox through the API, a validator failure will return an error 400
  • Validators are run for all modifications, please make sure your code is as lean as possible, with the least amount of dependencies

Testing validators

To test a validator on all the existing objects in Netbox you can follow those steps on a Netbox frontend host:

$ sudo -i -u netbox
$ cd /srv/deployment/netbox && . venv/bin/activate && cd deploy/src/netbox
$ sudo vi test_validator.py
### Paste the validator to test in the file
$ python manage.py nbshell
# [...SNIP...] also removing the >>> prefixes for easy copy-pasting
from test_validator import Main
v = Main()
# Adjust the model based on the object you want to test, for example for IPAddresses:
obj_type = IPAddress
for obj in obj_type.objects.all():
    try:
        v.validate(obj, None)
    except Exception as e:
        print(f"{obj} - {e}")

[Ctrl+d]
$ rm test_validator.py
$ tree  __pycache__/
__pycache__/
└── test_validator.cpython-39.pyc
$ rm -rf __pycache__

DON'T FORGET TO REMOVE THE TEST FILE AND THE CACHED ONE

Journaling

Netbox doc: https://netboxlabs.com/docs/netbox/en/stable/features/journaling/

How to use in Netbox scripts:

from extras.models import JournalEntry

JournalEntry.objects.create(assigned_object=my_object,comments='a comment', kind='info')

Where my_object can be any kind of Netbox object (interface, device, circuit, etc), and kind is defined in JournalEntryKindChoices (as of today: 'info' (default if not specified), 'danger', 'success', 'warning'. It's also possible to pass "created_by" a Netbox user, but this is set by default in Netbox scripts. How to use in pynetbox:

api.extras.journal_entries.create(assigned_object_type='<object_type>',assigned_object_id=<object_id>, comments='a comment', kind='info', created_by=<user_id>)
  • Comments and kind are similar to above
  • The object can't be passed directly to the create function, it needs to be split in type/id. For example assigned_object_type='dcim.device',assigned_object_id=d.id
    • afaik there is no way to programmatically retrieve the assigned the object_type from a given object.
  • By default, the journal entry will be written as the API key's owner, so usually "sre_bot". It's possible to specify a different user with for example: created_by=api.users.users.get(username='ayounsi').id

Exports

Set of resources that exports Netbox data in various formats.

DNS

A git repository of DNS zonefile snippets generated from Netbox data and exported via HTTPS in read-only mode to be consumed by the DNS#Authoritative_nameservers and the Continuous Integration tests run for the operations/dns Gerrit repository. The repository is available via:

 $ git clone https://netbox-exports.wikimedia.org/dns.git

To update the repository, see DNS/Netbox#Update_generated_records.

The repository is also mirrored in Phabricator: https://phabricator.wikimedia.org/source/netbox-exported-dns/ though it may not be immediately up-to-date.

Puppet

some of the information in netbox is useful in puppet for example

  • host rack location
  • hosts managment ip address
  • hosts status
  • network devices
  • network prefixes

In order to make this information available to puppet we have create the sre.puppet.sync-netbox-hiera cookbook which preforms the following actions.

At this point the data is available to puppet via the hiera entry for hosts and the common section and can be looked up with the normal hiera lookup methods e.g. to get the host location run lookup('profile::netbox::host::location')

In order to make it easier for users to consume this data we pre load it via dedicated profiles. As such the better way to load data is to include the specific class and then access the data.

include profile::netbox::host
if profile::netbox::host::location['rack'] == 'D3' {
  fail("${facts['networking']['hostname']} should not be in rack D3")
}

We also store bulk information in hiera related to the network and devices not managed by puppet e.g. network devices or management interfaces. This data is mostly useful for monitoring however in future it may replace the current uses of network::constants. you can load this data as follow however please not it is a lot of data and should only be included if needed

include profile::netbox::data
$profile::netbox::data::mgmt.each |$host, $data| {
  if $data['rack'] == 'D3' {
    fail("${host} should not be in rack D3")
  }
}
$profile::netbox::data::network_devices.each |$host, $data| {
  notice("${host} is a ${data['role']}")
}
$profile::netbox::data::prefixes.each |$prefix, $data| {
  notice("${prefix} (${data['description']}) is in vlan ${data['vlan']}")
}

Prometheus

The netbox-more-metrics plugin adds custom metrics (eg. about devices statistics) to the main /metrics Prometheus endpoint.

Which is used to generate https://grafana.wikimedia.org/d/ppq_8SRMk/netbox-device-statistic-breakdown?orgId=1

Imports

Ganeti sync

Refactor and improvements (eg. cluster_group support) in T262446

For each entry under the profile::netbox::ganeti_sync_profiles Hiera key, Puppet created a systemd timer on the active sever to run ganeti-netbox-sync.py with the matching parameters.

Ganeti and Netbox use conflicting and confusing naming, see Ganeti#Netbox naming disambiguation

External scripts

Scripts and tools not previously listed that interact with Netbox, and thus need to be checked for compatibility after significant Netbox changes (eg. upgrades).

Homer

Spicerack

Cookbooks with direct pynetbox calls:

  • sre.pdus.uptime
  • sre.pdus.rotate-snmp
  • sre.network.configure-switch-interfaces
  • sre.network.peering
  • sre.hosts.dhcp
  • sre.pdus.rotate-password
  • sre.hosts.provision
  • sre.hosts.reimage
  • sre.pdus.reboot-and-wait

Cookbooks using GraphQL

  • sre.puppet.sync-hiera

Pynetbox

Our in house package of pynetbox is now build using the CI pipeline: Debian packaging with dgit and CI

And hosted in https://gitlab.wikimedia.org/repos/sre/pynetbox

Current fleet wide version status: https://debmonitor.wikimedia.org/packages/python3-pynetbox

We also use pynetbox running in various virtualenv like Homer or Netbox itself.

Upgrading Netbox

There are 3 types of upgrades:

  1. Simple upgrade, only Netbox is upgraded
    • extremely simple procedure as within patch-level releases they maintain a reasonable level of compatibility in the APIs that we use. When it comes to upgrading across minor versions, breaking changes may have occurred and careful reading of changelogs is needed, as well as testing of the scripts which consume and manipulate data in Netbox
  2. Infrastructure upgrade, where new servers running a newer Netbox are built in parallel of production then switched over
    • Much more complex, see the Netbox 4 upgrade task for context
  3. Server refresh, Netbox stays at the same version but the underlying servers are refreshed (eg. new Debian version)
    • Never done, probably through failing over the backup server and failing back on the new one, or as infrastructure upgrade

Simple upgrade

  1. Review changelog and note any changes that may interact with our integrations or deployment
  2. Update Netbox repository
  3. Update netbox-deploy repository
  4. Deploy to netbox-dev200x
  5. Tests
  6. Review UI and note any differences to call out during announcement
  7. (if breaking changes) Port scripts
  8. (if breaking changes) Test scripts
  9. Deploy to production

Preparation

  • Check upstream changelog for any possible breaking changes (usually for major version change only)
    • Update Puppet configuration.py and/or scripts/reports/3rd party scripts accordingly
  • Check the upgrade.sh history and update netbox-deploy:Makefile.deploy accordingly in the dev branch.
  • Check that the plugins (as of today only "netbox-more-metrics") we use are compatible with the target Netbox version
  • Depending on the upgrade being performed, it could be preferable to upgrade all the dependencies (see "Build deploy repository" below) ahead of time to reduce the number of variables during the upgrade itself.

Update WMF Netbox repository

git clone https://github.com/netbox-community/netbox.git
git remote add gerrit ssh://<YOUR_GERRIT_USERNAME>@gerrit.wikimedia.org:29418/operations/software/netbox
git checkout master
git push gerrit master
git push --tags gerrit master

Build deploy repository

Netbox has a deployment repository with the collected artifacts (the virtual environment and associated libraries) is used to deploy it. This is updated separately from our branch of Netbox with the following procedure which uses the operations/software/netbox-deploy repository.

  1. In a working copy of operations/software/netbox-deploy, update the src/ subdirectory, which is a submodule of this repository pointing at WMF copy of Netbox Github repository; to do this git pull in that directory, and then check out the tag of the version that is being updated to, for example git checkout v4.0.7.
  2. Build the artifacts by doing a make clean and then make all. This uses docker to collect all of the required libraries as specified in the various requirements.txt files. It creates the artifacts as artifacts/artifacts.bookworm.tar.gz and frozen-requirements-bookworm.txt.
  3. Commit the changes to the repository and submit for review, be sure the following files have changes: frozen-requirements-bookworm.txt, artifacts/artifacts.bookworm.tar.gz, src.

Once the repository is reviewed and merged via Gerrit, it is ready to deploy!

Deploy to Testing Server

The next phase, even for simple upgrades is to deploy to netbox-dev2003.codfw.wmnet for basic testing prior to deploying to production.

  1. Login to a deploy server such as deploy1002.eqiad.wmnet
  2. Go to /srv/deployment/netbox/deploy; this is a check out of the -deploy repository from above.
  3. Pull to the latest version, make sure you're on the good branch (currently main) with git pull origin main
  4. Update the submodule in src : git submodule sync; git submodule update; and verify it's at the good commit with cd src/; git log -1 check that the /deploy directory doesn't have any outstanding changes in git status
  5. Deploy netbox-dev2003, with bug reference in hand running the cookbook:
    sudo cookbook sre.deploy.python-code -t T12345 -r 'Release v4.0.6 to netbox-next' -u netbox netbox 'A:netbox-canary'
    
  6. This process should go smoothly and leave the target machine ready to test.
  7. Deploy a new production database dump to netbox-dev2003's database to ensure parity with production.

Testing

Even for minor upgrades it's recommended to test as much features and code paths as possible, especially related to APIs and external tools. Thinks to look for are obviously errors, but also longer run time, or different data when possible.

You can cherry pick the tests you want to run depending on the changelog and the trust you have in Netbox.

Note that all production features are not used on -next, and some behaviors might not be visible there as well, for example related to the active/passive setup.

  • Test login, make sure your username is correct (and not an uuid)
  • Test scrips and reports (Customization->Scripts) and compare to production (run without error, doesn't take significantly longer, etc)
  • Look at some samples of Devices, IPs, VMs, etc and compare to production
  • Make sure any plugin is showing in the UI
  • Test a manual sync of the Netbox extras repo
  • Review background queues for any stuck or failed one
  • Test validators, eg. by trying to create a device with an invalid asset tag, or an enabled switch interface with a wrong MTU

On netbox-dev2003.codfw.wmnet

  • Run the netbox_housekeeping service and check for proper execution
  • Test the Ganeti VM sync script, for example by:
    • Temporarily adding the profile:codfw_test stanza to /etc/netbox/ganeti-sync.cfg
      [profile:codfw_test]
      site=codfw
      cluster=ganeti-test01.svc.codfw.wmnet
    • Run sudo -u netbox /srv/deployment/netbox/venv/bin/python3 /srv/deployment/netbox-extras/tools/ganeti-netbox-sync.py codfw_test
    • Optionally delete a VM from Netbox-next and verify that it's properly re-added
  • Test the DNs generation script, sudo -u netbox python3 /srv/deployment/netbox-extras/dns/generate_dns_snippets.py -v commit "test"
  • Check the logs for anything unexpected (in syslog and /srv/log/netbox/main.log
  • Check that the Django/Netbox metrics are properly exposed to Prometheus

On your computer

If possible and if you have the various tools checked out locally. Then to be re-run from cumin hosts once the migration is over.

  • Run Homer with a diff or generate action with your config file pointing to netbox-next. This should produce no diff if the database is updated to production contents
  • As much as realistically possible, run Spicerack and the cookbooks listed on Netbox#External scripts in dry-run while pointing the config to netbox-next.

Production testing

  • -next have some differences with the prod instance, especially that it runs as a standalone instance (all frontend, DB and Redis are on the same host). In production it's needed to check that the Jobs runs are properly dispatched by Redis (in the "Background Queues" page).
  • Force run the various Ganeti sync and reports systemd timers
  • Run the dns cookbook as dry-run
  • Run the Netbox Hiera cookbook as dry run
  • Check Netbox#Prometheus graphs
  • Check overall monitoring (Alertmanager/Icinga)

Porting and Testing

This covers both expected (in the changelog) and unexpected (eg. if we use changes. If possible, write patches in a backward compatibility way and deploy them ahead of the production infra upgrade.

Once porting is thought to be complete, the changes should be deployed to netbox-dev2003, Spicerack, Homer, etc, and a full run of testing should be done to verify that the changes made fix the problems that turned up in initial testing, including attempting to go down avenues of execution that may not normally be hit.

If it's not possible/convenient to write the patches in a backward compatible way, the upgrade timeline is critical, and the new patches will need to be deployed in the same maintenance window as the production upgrade, and possibly re-worked if any issue shows up in production.

Deploy to Production

After a final run through of any problem areas exposed in above testing and fixes are deployed to netbox-dev2003 it is finally time to deploy the new version to production, with the following procedure:

  1. Announce that the release will be occuring on #wikimedia-dcops, #mediawiki_security, #wikimedia-operations, #wikimedia-sre and, if necessary, coordinate a time when integration tools or DCops work will not be interrupted.
  2. Merge any outstanding changes to netbox-extras or netbox/deploy repositories (if necessary).
  3. On netboxdb[1-2]2003, perform a manual dump of the database.
  4. Deploy netbox-extras to production using cumin, as in #Netbox_Extras
  5. Announce on IRC that a deploy is happening, on #wikimedia-operations: !log Deploying Netbox v4... to production Tbug
  6. Deploy to production:
    1. Login to a Deployment server
    2. Go to /srv/deployment/netbox/deploy; this is a check out of the -deploy repository from above.
    3. Pull to the latest version, and update the submodule in src by pulling and checking out the tag that is going to be deployed.
    4. Deploy with the cookbook with bug reference in hand: sudo cookbook sre.deploy.python-code -r 'Release v4.xx to production' -u netbox netbox 'A:netbox'
    5. Make sure no issues happened during the deploy (eg. DB migration schema)
  7. Perform testing as above, and in general make sure everything is as expected
  8. Announce (and !log) the end of the upgrade

Troubleshooting

It's possible to run scripts and reports through the command line, for example:

python3 manage.py runscript interface_automation.ImportPuppetDB --data '{"device": "ml-cache1003"}' --commit

CablePath matching query does not exist

If getting the following error when saving a cable through a script, after deleting a cable:

dcim.models.cables.CablePath.DoesNotExist: CablePath matching query does not exist

Make sure you're running i.refresh_from_db() where i is the interface(s) where the previous cable was attached to.

Datasource sync: AttributeError: 'NoneType' object has no attribute 'sync'

Netbox 4. This only happened once, root cause unknown. Full stacktrace in syslog. We should open an upstream issue if it happens again.

  1. Open nbshell
  2. Run to find the problematic object. It should contain None in the line.
    asr = AutoSyncRecord.objects.all()
    for idx, a in enumerate(asr):
        print(f'{idx} - {a} - {a.object}')
    
  3. Delete it with asr[XXX].delete() where XXX is its list index.

Known issues Future improvements

Phabricator project - https://phabricator.wikimedia.org/tag/netbox/

Improve our infrastructure modeling

  • Make more extensive use of Netbox custom fields - task T305126

Improve automation and reduce tech debt

History

  • At Wikimedia it was evaluated in T170144 as a replacement for Racktables.
  • In T199083 the actual migration between the systems took place
  • T266487 - Netbox 2.9 upgrade
  • T288515 - Netbox vs. Nautobot
  • T296452 - documents the large upgrade from 2.10 to 3.2 and the subsequent improvements it brought
  • T314933 Upgrade Netbox to latest 3.2
  • T336275 - Upgrade Netbox to 4.x
  • T371889 - Upgrade Netbox to 4.1

See also