Netbox
Netbox is used by Wikimedia as a tool for data center infrastructure management (DCIM) and IP address management (IPAM). It also serves as an integration point for switch and port management, DNS management, and other network operations.
Web UI
- https://netbox.wikimedia.org/
- Login using your Developer account/LDAP/Wikitech credentials (Hiera
profile::netbox::cas_group_attribute_mapping
, *NOTE*: To login you must be a member of either the wmf or nda group as membership in those setsis_active=True
)nda
group members have a partial read-only access - task T302870wmf
group members have a full read-only accessops
group members have full read-write access- To request membership in any of those groups, see SRE/LDAP/Groups
Content-Security-Policy
headers are set - task T296356
API
- Endpoint
- From an internal host (eg. everything BUT your laptop or WMCS), you MUST use netbox.discovery.wmnet
- REST API
- Create a token on https://netbox.wikimedia.org/user/api-tokens/, ideally read-only, ideally with an expiration date
- Doc: https://netbox.wikimedia.org/api/docs/
- Python library: https://github.com/netbox-community/pynetbox/
- Note that the REST API is quite slow, make sure to optimize your queries and use pynetbox threading
- GraphQL
- Spicerack
- See https://doc.wikimedia.org/spicerack/master/api/spicerack.netbox.html for the Netbox support
- It's preferred to use the built in wrapper functions and not the pynetbox interface directly as your cookbook might break if they're not updated when Netbox introduces breaking changes
Staging
- It consists of a single bookworm VM (netbox-dev2003) combining frontend, redis and database
- Reachable on netbox-next.wikimedia.org and netbox-next.discovery.wmnet
- Behind caches, similarly to the prod infrastructure
- It's data comes from a manual dump of production's database
- Reach out to Infrastructure Foundations if you need a more fresh database
- Be careful not to leak any of its data
- It is used to test Netbox upgrades, scripts, reports, etc
- This host is active in monitoring (with notifications disabled)
- As such, make sure that all the alerts cleared after your tests
Production infrastructure
the production Netbox infrastructure consists of 4 bullseye VMs (see all Netbox VMs):
- 2 active/passive frontends (netboxXXXX)
- Using the central Redis
- 2 primary/replica postgresSQL databases (netboxdbXXXX)
By default the active/primary servers are the eqiad ones.
The public endpoint is behind our CDN so the request flow is:
- CDN - (using the wildcard *.wikimedia.org as its TLS certificate)
- active frontends
- Apache (using cfssl for its TLS certificate)
- Django app (through uwsgi)
- Active database
Monitoring
Icinga
See all Netbox related Icinga checks: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=netbox
In addition to the regular set of VM checks that are run on all servers, there are Icinga checks that only run on the active servers.
Controlled by the profile::netbox::db::primary
and profile::netbox::active_server
Hiera keys.
Frontends
Controlled by the profile::netbox::active_server
Hiera key:
- Alerting for the Ganeti sync systemd timers (are they running correctly?) - see also Netbox#Ganeti sync
- Alerting for the Netbox reports (is there invalid data in Netbox?) - see also Netbox#Reports
- Alerting for the DNS export automation (are there Uncommitted DNS changes in Netbox?) - See also Netbox#DNS
Databases
The replica have a check for replication delay.
Prometheus
Setup task: https://phabricator.wikimedia.org/T243928
Global health overview (beta): https://grafana.wikimedia.org/d/DvXT6LCnk/
Logstash
https://logstash.wikimedia.org/app/dashboards#/view/AXB84iDRKWrIH1QRIR_j
Failover
Frontends
Using confctl, pool the passive server and depool the previous active one.
confctl --object-type discovery select 'dnsdisc=netbox,name=codfw' set/pooled=true
confctl --object-type discovery select 'dnsdisc=netbox,name=eqiad' set/pooled=false
If the failover is going to last (eg. longer than a server reboot), change the profile::netbox::active_server
Hiera key to the backup server. This will ensure the cron/systemd timers as well as Icinga checks are running.
Note that having the active frontend in a different datacenter than the primary database will result in Netbox being slower.
Databases
If the primary database server needs a short downtime it's recommended to not try a failover and instead have Netbox offline for a short amount of time.
There are currently no documented procedure on how to fail the database over, and even less how to fail back to the former primary.
See also Postgres
Database
Restore
First of all analyze the Netbox changelog to choose what's the best action to perform a restore.
The general options are:
- Manually (or via the API) re-play the actions listed in the changelog in reverse order. The changelog entries don't have full raw data, some of them might show the names instead of the IDs required in the API
- Restore a database dump. This ensure consistency at a given point in time, and could even be used to perform some partial restore using
pg_restore
.
To restore files from Bacula back to the client, use bconsole on helium and refer to Bacula#Restore_(aka_Panic_mode) for detailed steps.
PostgresSQL
Dumps backups
On the database servers, a Puppetized systemd timer (class postgresql::backup) automatically creates a daily dumps file of all local Postgres databases (pg_dumpall) and stores it in /srv/postgres-backup
- On primary node: daily dumps
- On secondary node: hourly dumps
This path is then backed up by Bacula#Adding a new client
For more details, the related subtask to setup backups was Phab:T190184, improved in task T262677
Stop the Netbox services before restore
Postgres may prevent us from deleting the current database before the restore if there are active remote connections to it. This is usually not the case but it has been observed on occasion. To minimize the chance of it happening we should stop the Netbox services on primary netbox host (as of June 2022 netbox1002
) prior to restoring the db.
First downtime the active netbox server with the downtime cookbook from a cumin host:
sudo cookbook sre.hosts.downtime --minutes 30 -r "Restoring DB from backup on netboxdb1002" -t <task> netbox1002.eqiad.wmnet
Then on the active netbox host itself:
sudo systemctl stop rq-netbox
sudo systemctl stop uwsgi-netbox.service
Restore the DB dump
NOTE: The below instructions are valid to copy live db dump to 'netbox-next', the postgress db runs locally on that host rather than connecting to a dedicated DB VM. Also see note below about running puppet afterwards to change the netbox db user password (which is different on the dev host so should be changed if we dump the db from live one).
- Check the dump files on the secondary DB host (as of Dec. 2022
netboxdb2003
) in/srv/postgres-backup
, if any issue with those files, do the same on the primary host. The secondary host performs hourly backups while the primary only daily. - If the secondary host has a more newer backup:
- Copy the dump to the primary DB host (as of Dec. 2022
netboxdb1003
), you can use from one of the cumin hosts (cumin1002.eqiad.wmnet, cumin2002.codfw.wmnet
) as root, run:SSH_AUTH_SOCK=/run/keyholder/proxy.sock scp -3 root@netboxdb2003.codfw.wmnet:/srv/postgres-backup/psql-all-dbs-latest.sql.gz root@netboxdb1003.eqiad.wmnet:/srv/
- SSH into the primary DB host (as of Nov. 2024
netboxdb1003
) - Change the permissions of the copied backup to be owned by
postgres:postgres
- Copy the dump to the primary DB host (as of Dec. 2022
- Take a one-off backup on the primary DB host (as of Dec. 2022
netboxdb1003
) right before starting the restore with (the .bak suffix is important to not be auto-evicted):$ su - postgres $ /usr/bin/pg_dumpall | /bin/gzip > /srv/postgres-backup/${USER}-DESCRIPTION.psql-all-dbs-$(date +\%Y\%m\%d).sql.gz.bak
- Become postgres user:
sudo -i -u postgres
- Connect to the DB, list and drop the Netbox database:
$ psql
postgres=# \l
...
postgres=# DROP DATABASE netbox;
DROP DATABASE
postgres=#
# NOTE - you may still get a message saying 'database "netbox" is being accessed by other users' which prevents you dropping the db. These can be from the active netbox host, running reports triggered by systemd timers, and the backup db host. It is probably easiet to wait until these complete and re-try, if it cannot wait probably the services/processes connecting from those remote hosts need to be stopped. As a last resort 'DROP DATABASE "netbox" WITH(FORCE);' can be used.
- Still as the postgres user, restore the DB with:
$ gunzip < /srv/psql-all-dbs-SOME_DATE.sql.gz | /usr/bin/psql
- DEV Host password
If the dump has been restored to the DEV host hosting netbox-next
, run puppet to fix the netbox DB user password at this time. NOTE: It has been noticed recently (Nov 2024) that puppet is not adjusting the password in some cases. If there are logs such as 'password authentication failed for user "netbox"' you can manually change the netbox user password (password is available in /etc/netbox/configuration.py):
sudo -i -u postgres
psql netbox
ALTER USER netbox WITH PASSWORD '<password>';
Start Netbox services after a restore
After the DB has been restored we can restart Netbox. If the restore was on the nebox-next host first run puppet to fix the DB password. SSH into the Netbox active host (as of June 2022 netbox1003
) and execute:
sudo systemctl restart uwsgi-netbox.service
sudo systemctl restart rq-netbox.service
sudo systemctl status uwsgi-netbox.service
sudo systemctl status rq-netbox.service
Then check the logs in /srv/log/netbox/main.log
and that netbox.wikimedia.org
works properly. Check also the last item in the Netbox changelog section in the UI to ensure the data is correctly loaded.
Sanitizing a database dump
The Netbox database contains a few bits of sensitive information, and if it is going to be used for testing purposes in WMCS it should be sanitized first.
- Create a copy of the main database
createdb netbox-sanitize && pg_dump netbox | psql netbox-sanitize
- Run the below SQL code on
netbox-sanitize
database. - Dump and drop database
pg_dump netbox-sanitize > netbox-sanitized.sql
;dropdb netbox-sanitize
THE BELOW COMMANDS ARE OUTDATED AND MIGHT NOT COVER EVERYTHING THAT NEEDS TO BE SANITIZED
-- truncate secrets
TRUNCATE secrets_secret CASCADE;
TRUNCATE secrets_sessionkey CASCADE;
TRUNCATE secrets_userkey CASCADE;
-- sanitize dcim_serial
UPDATE dcim_device SET serial = concat('SERIAL', id::TEXT);
-- truncate user table
TRUNCATE auth_user CASCADE;
-- sanitize dcim_interface.mac_address
UPDATE dcim_interface SET mac_address = CONCAT(
LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0'), ':',
LPAD(TO_HEX(FLOOR(random() * 255 + 1) :: INT)::TEXT, 2, '0')) :: macaddr;
-- sanitize cricuits_circuit.cid
UPDATE circuits_circuit SET cid = concat('CIRCUIT', id::TEXT);
Netbox Extras
CustomScripts, reports (merged with scripts in the UI), validators and other associated tools for Netbox are collected in the netbox-extras repository at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/.
as a safeguard, after merging your change, its deployment is not fully automatic.
TL;DR; after merging your change use the sre.netbox.update-extras
cookbook.
- For Netbox DEV:
sudo cookbook sre.netbox.update-extras --reason 'a good reason' -a netbox-canary
- For Netbox PROD:
sudo cookbook sre.netbox.update-extras --reason 'a good reason' -a netbox
More details
Since Netbox 4, the repository needs to be deployed in 2 different locations (both on the frontends):
- Through the local checkout of the git repository under
/srv/deployment/netbox-extras
, for validators, tools (Ganeti sync, dns sync) and sharedcommon.py
scripts code. - Through Netbox 's DataSource module, for scripts and reports, which ultimately copies them to
/srv/netbox/customscripts/
after going through its internal DB.
Netbox (and source of truth) principles
- Data automatically synced from the infrastructure should not drive the infrastructure
- It is to be used for information purposes (eg. VM disk space) or as support for original data (eg. server interfaces for cables/IP/dns_name)
- All data manually entered will have entry mistakes
- Use helper scripts, input validation or post entry consistency checks (reports)
- All data manually entered will go stale
- Refrain from adding data that will not drive the infrastructure
Netbox features
Custom Links
WebUI (defined there): https://netbox.wikimedia.org/extras/custom-links/
Doc: https://docs.netbox.dev/en/stable/models/extras/customlink/
Netbox allow to setup custom links to other websites using Jinja2 templating for both the visualized name and the actual link, allowing for quite some flexibility. The current setup (as of August 2024) has the following links:
- Grafana (for all physical devices and VMs)
- Icinga (for all physical devices and VMs)
- AlertManager (for all physical devices and VMs)
- Debmonitor (for all physical devices and VMs)
- Procurement Ticket (only for physical devices that have a ticket that matches either Phabricator or RT)
- Hardware config (for Dell and HP physical devices, pointing to the manufacturer page for warranty information based on their serial number)
- LibreNMS (for Juniper, opengear and sentry devices)
- Puppetboard (for all physical devices and VMs)
Reports
WebUI (reports results): https://netbox.wikimedia.org/extras/scripts/
Doc: https://netboxlabs.com/docs/netbox/en/stable/customization/reports/
Netbox reports are a way of validating data within Netbox.
In summary, reports produce a series of log lines that indicate some status connected to a machine, and may be either error
, warning
, or success
. Log lines with no particular disposition for information purposes may also be emitted. Log lines can be tied to any Netbox object for easier referring.
Report Conventions
Scripts and reports called by systemd timers use the local user sre_bot
.
Because of limitations to the UI for Netbox reports, certain conventions have emerged:
- Reports should emit one
log_error
line for each failed item. If the item doesn't exist as a Netbox object,None
may be passed in place of the first argument. - If any
log_warning
lines are produced, they should be grouped after the loop which produceslog_error
lines. - Reports should emit one
log_success
which contains a summary of successes, as the last log in the report. - Log messages referring to a single object should be formatted like <verb/condition> <noun/subobject>[: <explanatory extra information>]. Examples:
- malformed asset tag: WNF1212
- missing purchase date
- Summary log messages should be formatted like <count> <verb/condition> <noun/subobject>
- If possible followed with a suggestion on how to fix it (for example what are the proper values)
Report Alert
Most reports that alert are data integrity mismatches due to changes in infrastructure, as a secondary check, and the responsibility of DC-ops.
Some (eg. network report) can have unforeseen consequences on the infrastructure (eg. miss-configurations).
Report | Typical Responsibility | Alerts | Typical Error(s) | Note |
---|---|---|---|---|
Accounting | I/F or DC-ops | ✅ | ||
Cables | DC-ops | ✅ | ||
Coherence | DC-Ops | ✅ | ||
LibreNMS | DC-ops or Netops | ✅ | You can ignore a LibreNMS device by setting its "ignore alert" flag in LibreNMS | |
Management | DC-ops | ✅ | ||
PuppetDB | Whoever changed / reimaged host | ✅ | <device> missing from PuppetDB or <device> missing from Netbox. These occur because the data in PuppetDB does not match the data in Netbox, typically related to missing devices or unexpected devices. Generally these errors fix themselves once the reimage is complete, but the Netbox record for the host may need to be updated for decommissioning and similar operations. | |
Network | DC-ops or Netops | ✅ |
Custom Scripts
WebUI: https://netbox.wikimedia.org/extras/scripts/
Doc: https://docs.netbox.dev/en/stable/customization/custom-scripts/
While Netbox reports are read-only and have a fixed output format, CustomScripts can both write to the database and provide custom output.
In our infrastructure they're used for those two aspects:
- Abstract and automate data entry,
- Import Server Facts - imports host network information from PuppetDB into Netbox
- Move Server - moves server location from one place to another making necessary adjustments
- Provision Server Network - adds server to switch connection
- Offline_device - set a device to offline, removing it from rack and deleting network connections
- Replace_device - used to move all attributes from one device to another when being replaced
- add_secondary_ips - for hosts like Cassandra that require more than 1 IP on its primary interface
- Format and expose data in a way that can be consumed by external tools,
- Capirca
The above scripts should probably be moved to the plugin feature.
Review the changes that would happen.
Then a second time with "Commit changes" checked to make the changes permanent.Extra Errors, Notes and Procedures
Would like to remove interface
This error is produced in the Interface Automation script when cleaning up old interfaces during an import.
Interfaces are considered for removal if they don't appear in the list provided by the data source (generally speaking, PuppetDB); they are then checked if there is an IP address or a cable associated with the interface. If there is one of these the interface is left in place so as to not lose data. It is considered a bug if this happens, so if you see this error in an output feel free to open a ticket against #netbox in Phabricator.
Error removing interface after speed change
This error is produced in the Interface Automation script when cleaning up old interfaces when provisioning a server's network attributes.
Specifically for modular interfaces on Juniper devices, the interface name is determined by the speed of the interface, and the port number. If an old interface exists, say xe-1/0/8, on a modular port and we replace the 10G SFP+ with a 25G SFP28, the name of the interface will change to et-1/0/8. JunOS cannot have both defined so the import script will remove the old (xe-1/0/8) interface in Netbox before adding the new one.
This error will get thrown if the old interface still has a cable connected, or an IP address assigned. This shouldn't normally happen, but if it does the old interface should be manually removed, and cables/IPs cleaned up as necessary. Feel free to ping netops members on IRC if there is any confusion, or open a Phabricator task.
Jobs dispatching
When ran, a script (or report) is dispatched as a Redis job. The first frontend to pick it up (through the rq-netbox
service) used to get to execute it.
However because of a limitation documented in T341843 rq-netbox
now only runs on the primary frontend, which needs to be the frontend local to the Redis server.
Still, make sure that all frontends are in sync (the update-extras
cookbook does the right thing).
Custom Fields
WebUI (defined there): https://netbox.wikimedia.org/extras/custom-fields/
Doc: https://docs.netbox.dev/en/stable/customization/custom-fields/
Please open a task for the I/F team if you need a new Custom Field.
Data sources
WebUI: https://netbox.wikimedia.org/core/data-sources/
See also the "netbox-extra" section of this document. Only the source named "Netbox extra" is synced with the netbox-extras cookbook.
nbshell
Not a user facing feature, but an admin feature, useful for troubleshooting.
Doc: https://docs.netbox.dev/en/stable/administration/netbox-shell/
The bellow command will drop you in a python shell with access to all the Netbox models, similarly to what the CustomScripts use.
sudo -i
cd /srv/deployment/netbox && . venv/bin/activate && cd deploy/src/netbox && python manage.py nbshell
When performing changes it's ideal to make them show up in the Netbox changelog. That can be achieved with something like this:
import uuid
request_id = uuid.uuid4()
user = User.objects.get(username='my_username')
# When modifying an object save also the changes:
device = Device.objects.get(name='hostname')
device.comments = 'some comment'
# See the available choices in:
# https://github.com/netbox-community/netbox/blob/master/netbox/extras/choices.py#L81
log = device.to_objectchange('update') # create/update/delete
log.request_id = request_id
log.user = user
log.save()
device.save()
For a create, add the log entry after the object creation. For a delete add it before the object deletion.
Tags/ConfigContext
Tags are a slippery slope as they are global and don't have built in mechanism to prevent typos. ConfigContext are much more difficult to audit than fields. We've so far managed to not need them. Therefore,
If you think you need one, please open a task for the I/F team to discuss it.
Housekeeping
A systemd timer runs once a day to perform background cleanup of expired data. More details on https://docs.netbox.dev/en/stable/administration/housekeeping/
Validators
Tracked in https://phabricator.wikimedia.org/T310590
Unlike reports which only trigger after the erroneous change was made, validators ensure that the data entered (using the UI or the API) respect our custom ("business") rules.
Reports should only be used when validators are not suitable (eg. using external tool, sequence of changes).
Our convention is to have a single validator file per Netbox model, each of them have a single class class Main(CustomValidator)
. Per Netbox requirements, this class MUST have a function named def validate(self, instance)
.
To "activate" the validator, add the model to the relevant profile::netbox::validators
key (prod or dev).
A few things to keep in mind when working on validators:
- Bugs in the validators code will return an error 500 when the user will try to interact with Netbox, which could make them think that there is something wrong with Netbox itself.
- When editing Netbox through the API, a validator failure will return an error 400
- Validators are run for all modifications, please make sure your code is as lean as possible, with the least amount of dependencies
Testing validators
To test a validator on all the existing objects in Netbox you can follow those steps on a Netbox frontend host:
$ sudo -i -u netbox
$ cd /srv/deployment/netbox && . venv/bin/activate && cd deploy/src/netbox
$ sudo vi test_validator.py
### Paste the validator to test in the file
$ python manage.py nbshell
# [...SNIP...] also removing the >>> prefixes for easy copy-pasting
from test_validator import Main
v = Main()
# Adjust the model based on the object you want to test, for example for IPAddresses:
obj_type = IPAddress
for obj in obj_type.objects.all():
try:
v.validate(obj, None)
except Exception as e:
print(f"{obj} - {e}")
[Ctrl+d]
$ rm test_validator.py
$ tree __pycache__/
__pycache__/
└── test_validator.cpython-39.pyc
$ rm -rf __pycache__
DON'T FORGET TO REMOVE THE TEST FILE AND THE CACHED ONE
Journaling
Netbox doc: https://netboxlabs.com/docs/netbox/en/stable/features/journaling/
How to use in Netbox scripts:
from extras.models import JournalEntry
JournalEntry.objects.create(assigned_object=my_object,comments='a comment', kind='info')
Where my_object can be any kind of Netbox object (interface, device, circuit, etc), and kind is defined in JournalEntryKindChoices
(as of today: 'info' (default if not specified), 'danger', 'success', 'warning'. It's also possible to pass "created_by" a Netbox user, but this is set by default in Netbox scripts.
How to use in pynetbox:
api.extras.journal_entries.create(assigned_object_type='<object_type>',assigned_object_id=<object_id>, comments='a comment', kind='info', created_by=<user_id>)
- Comments and kind are similar to above
- The object can't be passed directly to the create function, it needs to be split in type/id. For example
assigned_object_type='dcim.device',assigned_object_id=d.id
- afaik there is no way to programmatically retrieve the assigned the object_type from a given object.
- By default, the journal entry will be written as the API key's owner, so usually "sre_bot". It's possible to specify a different user with for example:
created_by=api.users.users.get(username='ayounsi').id
Exports
Set of resources that exports Netbox data in various formats.
DNS
A git repository of DNS zonefile snippets generated from Netbox data and exported via HTTPS in read-only mode to be consumed by the DNS#Authoritative_nameservers and the Continuous Integration tests run for the operations/dns
Gerrit repository.
The repository is available via:
$ git clone https://netbox-exports.wikimedia.org/dns.git
To update the repository, see DNS/Netbox#Update_generated_records.
The repository is also mirrored in Phabricator: https://phabricator.wikimedia.org/source/netbox-exported-dns/ though it may not be immediately up-to-date.
Puppet
some of the information in netbox is useful in puppet for example
- host rack location
- hosts managment ip address
- hosts status
- network devices
- network prefixes
In order to make this information available to puppet we have create the sre.puppet.sync-netbox-hiera
cookbook which preforms the following actions.
- uses graphQL to retrieve the necessary data
- publishes the information to git via https://netbox-exports.wikimedia.org/netbox-hiera/
- syncs the data to all netbox and cumin hosts
- syncs the data to the puppetmaster hosts
At this point the data is available to puppet via the hiera entry for hosts and the common section and can be looked up with the normal hiera lookup methods e.g. to get the host location run lookup('profile::netbox::host::location')
In order to make it easier for users to consume this data we pre load it via dedicated profiles. As such the better way to load data is to include the specific class and then access the data.
include profile::netbox::host
if profile::netbox::host::location['rack'] == 'D3' {
fail("${facts['networking']['hostname']} should not be in rack D3")
}
We also store bulk information in hiera related to the network and devices not managed by puppet e.g. network devices or management interfaces. This data is mostly useful for monitoring however in future it may replace the current uses of network::constants
. you can load this data as follow however please not it is a lot of data and should only be included if needed
include profile::netbox::data
$profile::netbox::data::mgmt.each |$host, $data| {
if $data['rack'] == 'D3' {
fail("${host} should not be in rack D3")
}
}
$profile::netbox::data::network_devices.each |$host, $data| {
notice("${host} is a ${data['role']}")
}
$profile::netbox::data::prefixes.each |$prefix, $data| {
notice("${prefix} (${data['description']}) is in vlan ${data['vlan']}")
}
Prometheus
The netbox-more-metrics plugin adds custom metrics (eg. about devices statistics) to the main /metrics Prometheus endpoint.
Which is used to generate https://grafana.wikimedia.org/d/ppq_8SRMk/netbox-device-statistic-breakdown?orgId=1
Imports
Ganeti sync
Refactor and improvements (eg. cluster_group support) in T262446
For each entry under the profile::netbox::ganeti_sync_profiles
Hiera key, Puppet created a systemd timer on the active sever to run ganeti-netbox-sync.py with the matching parameters.
Ganeti and Netbox use conflicting and confusing naming, see Ganeti#Netbox naming disambiguation
External scripts
Scripts and tools not previously listed that interact with Netbox, and thus need to be checked for compatibility after significant Netbox changes (eg. upgrades).
Cookbooks with direct pynetbox calls:
- sre.pdus.uptime
- sre.pdus.rotate-snmp
- sre.network.configure-switch-interfaces
- sre.network.peering
- sre.hosts.dhcp
- sre.pdus.rotate-password
- sre.hosts.provision
- sre.hosts.reimage
- sre.pdus.reboot-and-wait
Cookbooks using GraphQL
- sre.puppet.sync-hiera
Pynetbox
Our in house package of pynetbox is now build using the CI pipeline: Debian packaging with dgit and CI
And hosted in https://gitlab.wikimedia.org/repos/sre/pynetbox
Current fleet wide version status: https://debmonitor.wikimedia.org/packages/python3-pynetbox
We also use pynetbox running in various virtualenv like Homer or Netbox itself.
Upgrading Netbox
There are 3 types of upgrades:
- Simple upgrade, only Netbox is upgraded
- extremely simple procedure as within patch-level releases they maintain a reasonable level of compatibility in the APIs that we use. When it comes to upgrading across minor versions, breaking changes may have occurred and careful reading of changelogs is needed, as well as testing of the scripts which consume and manipulate data in Netbox
- Infrastructure upgrade, where new servers running a newer Netbox are built in parallel of production then switched over
- Much more complex, see the Netbox 4 upgrade task for context
- Server refresh, Netbox stays at the same version but the underlying servers are refreshed (eg. new Debian version)
- Never done, probably through failing over the backup server and failing back on the new one, or as infrastructure upgrade
Simple upgrade
- Review changelog and note any changes that may interact with our integrations or deployment
- Update Netbox repository
- Update netbox-deploy repository
- Deploy to netbox-dev200x
- Tests
- Review UI and note any differences to call out during announcement
- (if breaking changes) Port scripts
- (if breaking changes) Test scripts
- Deploy to production
Preparation
- Check upstream changelog for any possible breaking changes (usually for major version change only)
- Update Puppet configuration.py and/or scripts/reports/3rd party scripts accordingly
- Check the upgrade.sh history and update
netbox-deploy:Makefile.deploy
accordingly in the dev branch. - Check that the plugins (as of today only "netbox-more-metrics") we use are compatible with the target Netbox version
- Depending on the upgrade being performed, it could be preferable to upgrade all the dependencies (see "Build deploy repository" below) ahead of time to reduce the number of variables during the upgrade itself.
Update WMF Netbox repository
git clone https://github.com/netbox-community/netbox.git
git remote add gerrit ssh://<YOUR_GERRIT_USERNAME>@gerrit.wikimedia.org:29418/operations/software/netbox
git checkout master
git push gerrit master
git push --tags gerrit master
Build deploy repository
Netbox has a deployment repository with the collected artifacts (the virtual environment and associated libraries) is used to deploy it. This is updated separately from our branch of Netbox with the following procedure which uses the operations/software/netbox-deploy repository.
- In a working copy of operations/software/netbox-deploy, update the src/ subdirectory, which is a submodule of this repository pointing at WMF copy of Netbox Github repository; to do this
git pull
in that directory, and then check out the tag of the version that is being updated to, for examplegit checkout v4.0.7
. - Build the artifacts by doing a
make clean
and thenmake all
. This uses docker to collect all of the required libraries as specified in the various requirements.txt files. It creates the artifacts asartifacts/artifacts.bookworm.tar.gz
andfrozen-requirements-bookworm.txt
. - Commit the changes to the repository and submit for review, be sure the following files have changes: frozen-requirements-bookworm.txt, artifacts/artifacts.bookworm.tar.gz, src.
Once the repository is reviewed and merged via Gerrit, it is ready to deploy!
Deploy to Testing Server
The next phase, even for simple upgrades is to deploy to netbox-dev2003.codfw.wmnet
for basic testing prior to deploying to production.
- Login to a deploy server such as
deploy1002.eqiad.wmnet
- Go to
/srv/deployment/netbox/deploy
; this is a check out of the -deploy repository from above. - Pull to the latest version, make sure you're on the good branch (currently
main
) withgit pull origin main
- Update the submodule in
src
:git submodule sync; git submodule update;
and verify it's at the good commit withcd src/; git log -1
check that the/deploy
directory doesn't have any outstanding changes ingit status
- Deploy
netbox-dev2003
, with bug reference in hand running the cookbook:sudo cookbook sre.deploy.python-code -t T12345 -r 'Release v4.0.6 to netbox-next' -u netbox netbox 'A:netbox-canary'
- This process should go smoothly and leave the target machine ready to test.
- Deploy a new production database dump to
netbox-dev2003
's database to ensure parity with production.
Testing
Even for minor upgrades it's recommended to test as much features and code paths as possible, especially related to APIs and external tools. Thinks to look for are obviously errors, but also longer run time, or different data when possible.
You can cherry pick the tests you want to run depending on the changelog and the trust you have in Netbox.
Note that all production features are not used on -next, and some behaviors might not be visible there as well, for example related to the active/passive setup.
- Test login, make sure your username is correct (and not an uuid)
- Test scrips and reports (Customization->Scripts) and compare to production (run without error, doesn't take significantly longer, etc)
- Look at some samples of Devices, IPs, VMs, etc and compare to production
- Make sure any plugin is showing in the UI
- Test a manual sync of the Netbox extras repo
- Review background queues for any stuck or failed one
- Test validators, eg. by trying to create a device with an invalid asset tag, or an enabled switch interface with a wrong MTU
On netbox-dev2003.codfw.wmnet
- Run the
netbox_housekeeping
service and check for proper execution - Test the Ganeti VM sync script, for example by:
- Temporarily adding the profile:codfw_test stanza to
/etc/netbox/ganeti-sync.cfg
[profile:codfw_test] site=codfw cluster=ganeti-test01.svc.codfw.wmnet
- Run
sudo -u netbox /srv/deployment/netbox/venv/bin/python3 /srv/deployment/netbox-extras/tools/ganeti-netbox-sync.py codfw_test
- Optionally delete a VM from Netbox-next and verify that it's properly re-added
- Temporarily adding the profile:codfw_test stanza to
- Test the DNs generation script,
sudo -u netbox python3 /srv/deployment/netbox-extras/dns/generate_dns_snippets.py -v commit "test"
- Check the logs for anything unexpected (in syslog and
/srv/log/netbox/main.log
- Check that the Django/Netbox metrics are properly exposed to Prometheus
On your computer
If possible and if you have the various tools checked out locally. Then to be re-run from cumin hosts once the migration is over.
- Run Homer with a diff or generate action with your config file pointing to netbox-next. This should produce no diff if the database is updated to production contents
- As much as realistically possible, run Spicerack and the cookbooks listed on Netbox#External scripts in dry-run while pointing the config to netbox-next.
Production testing
- -next have some differences with the prod instance, especially that it runs as a standalone instance (all frontend, DB and Redis are on the same host). In production it's needed to check that the Jobs runs are properly dispatched by Redis (in the "Background Queues" page).
- Force run the various Ganeti sync and reports systemd timers
- Run the dns cookbook as dry-run
- Run the Netbox Hiera cookbook as dry run
- Check Netbox#Prometheus graphs
- Check overall monitoring (Alertmanager/Icinga)
Porting and Testing
This covers both expected (in the changelog) and unexpected (eg. if we use changes. If possible, write patches in a backward compatibility way and deploy them ahead of the production infra upgrade.
Once porting is thought to be complete, the changes should be deployed to netbox-dev2003, Spicerack, Homer, etc, and a full run of testing should be done to verify that the changes made fix the problems that turned up in initial testing, including attempting to go down avenues of execution that may not normally be hit.
If it's not possible/convenient to write the patches in a backward compatible way, the upgrade timeline is critical, and the new patches will need to be deployed in the same maintenance window as the production upgrade, and possibly re-worked if any issue shows up in production.
Deploy to Production
After a final run through of any problem areas exposed in above testing and fixes are deployed to netbox-dev2003
it is finally time to deploy the new version to production, with the following procedure:
- Announce that the release will be occuring on
#wikimedia-dcops, #mediawiki_security, #wikimedia-operations, #wikimedia-sre
and, if necessary, coordinate a time when integration tools or DCops work will not be interrupted. - Merge any outstanding changes to
netbox-extras
ornetbox/deploy
repositories (if necessary). - On
netboxdb[1-2]2003
, perform a manual dump of the database. - Deploy
netbox-extras
to production using cumin, as in #Netbox_Extras - Announce on IRC that a deploy is happening, on
#wikimedia-operations
:!log Deploying Netbox v4... to production Tbug
- Deploy to production:
- Login to a Deployment server
- Go to
/srv/deployment/netbox/deploy
; this is a check out of the -deploy repository from above. - Pull to the latest version, and update the submodule in
src
by pulling and checking out the tag that is going to be deployed. - Deploy with the cookbook with bug reference in hand:
sudo cookbook sre.deploy.python-code -r 'Release v4.xx to production' -u netbox netbox 'A:netbox'
- Make sure no issues happened during the deploy (eg. DB migration schema)
- Perform testing as above, and in general make sure everything is as expected
- Announce (and !log) the end of the upgrade
Troubleshooting
It's possible to run scripts and reports through the command line, for example:
python3 manage.py runscript interface_automation.ImportPuppetDB --data '{"device": "ml-cache1003"}' --commit
CablePath matching query does not exist
If getting the following error when saving a cable through a script, after deleting a cable:
dcim.models.cables.CablePath.DoesNotExist: CablePath matching query does not exist
Make sure you're running i.refresh_from_db()
where i
is the interface(s) where the previous cable was attached to.
Datasource sync: AttributeError: 'NoneType' object has no attribute 'sync'
Netbox 4. This only happened once, root cause unknown. Full stacktrace in syslog. We should open an upstream issue if it happens again.
- Open nbshell
- Run to find the problematic object. It should contain
None
in the line.asr = AutoSyncRecord.objects.all() for idx, a in enumerate(asr): print(f'{idx} - {a} - {a.object}')
- Delete it with
asr[XXX].delete()
where XXX is its list index.
Known issues Future improvements
Phabricator project - https://phabricator.wikimedia.org/tag/netbox/
Improve our infrastructure modeling
- Make more extensive use of Netbox custom fields - task T305126
Improve automation and reduce tech debt
- Netbox: investigate GraphQL API - task T310577
- Netbox: use Journaling feature - task T310583
- Netbox: basic change rollback - task T310589
- Netbox in codfw slowness issue (path to active/active) - task T330883
History
- At Wikimedia it was evaluated in T170144 as a replacement for Racktables.
- In T199083 the actual migration between the systems took place
- T266487 - Netbox 2.9 upgrade
- T288515 - Netbox vs. Nautobot
- T296452 - documents the large upgrade from 2.10 to 3.2 and the subsequent improvements it brought
- T314933 Upgrade Netbox to latest 3.2
- T336275 - Upgrade Netbox to 4.x
- T371889 - Upgrade Netbox to 4.1
See also
External link
- https://netbox.wikimedia.org (restricted)