Juniper router upgrade
Appearance
Known issues
Still valid
- Junos 21.4R2-Sx and later will not work with
system services ssh root-login deny
(the FPC won't come online after the upgrade)
Obsolete
- Junos 21.2R2-Sx have an incompatibility with older (at least 17.x) Junos, preventing VRRP adjacency to establish with a MD5 key
- Certain REs have an Intel i40e on-board, and firmware needs to be upgraded before newer JunOS can be loaded. See notes at Juniper RE i40e firmware.
- when upgrading from pre-21.2R1 to 21.2R1 or later,
no-validate
is required
Preparation
- List on the task the new interesting features based on https://apps.juniper.net/feature-explorer/
- Download the proper image to apt1001:/srv/junos/
- We now only use 64bits vmhost
- Based on upgrade task and Juniper recommended
All the steps bellow should be done with:
cumin1001:~$ sudo cookbook sre.network.prepare-upgrade <image-filename>.tgz <router-fqdn>
- Make room for the image
request system storage cleanup
- If multi-RE, cleanup files on backup RE:
request system storage cleanup re1
- Save rescue config (just in case)
request system configuration rescue save
- Copy image
file copy "https://apt.wikimedia.org/junos/$filename.tgz" /var/tmp/ routing-instance mgmt_junos
- As data point this takes ~1h15 from eqiad to ulsfo
- Check checksum
file checksum md5 /var/tmp/$filename.tgz
- Compare with checksum on Juniper's website
- Validate new image against existing config
request vmhost software validate /var/tmp/$filename.tgz
Upgrade
- Check if console port(s) is(/are) working
- Depool site (optional)
- (optional) if codfw, drain mw traffic
sudo cookbook sre.mediawiki.route-traffic primary
- (optional) if codfw, drain mw traffic
- Drain traffic away from router
- NOT TESTED YET - apply GRACEFUL_SHUTDOWN - T320230
set protocols bgp graceful-shutdown sender
- Disable the peers
set protocols bgp group Transit4 shutdown
set protocols bgp group Transit6 shutdown
set protocols bgp group IX4 shutdown
set protocols bgp group IX6 shutdown
- Adjust OSPF metrics
- If eqiad/codfw drain the pfw3 link:
set policy-options policy-statement BGP_fundraising_in term address then local-preference 50
set protocols bgp group fundraising metric-out 500
- NOT TESTED YET - apply GRACEFUL_SHUTDOWN - T320230
- Ensure router is not VRRP master
show vrrp summary
set groups vrrp interfaces <*> unit <*> family inet address <*> vrrp-group <*> priority 70
set groups vrrp interfaces <*> unit <*> family inet6 address <*> vrrp-inet6-group <*> priority 70
- Note: if specific priorities are set on vrrp groups priority needs to be reduced on the specific groups also.
- Downtime host in Icinga and Alert-manager
sudo cookbook sre.hosts.downtime -r 'router upgrade' -t XXX -H 2 --force 'cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt'
- This needs to match the Icinga "hosts",
cr3-ulsfo
will match in AlertManager as well. - NOTE: For devices with multiple REs you will probably find the mgmt hosts in Icinga named like 're0.cr3-esams.mgmt'
- Double check site has been fully drained of traffic before proceeding:
- Check no traffic to LVS at site: https://grafana-rw.wikimedia.org/d/000000343/load-balancers-lvs
- Check Cloudflare DDoS tunnels are disabled for site:
sudo cookbook sre.network.cf status all
- Check LibreNMS graphs for router in question: https://librenms.wikimedia.org/devices/type=network
- Disable BGP sessions to LVS/PyBal load-balancers
deactivate protocols bgp group PyBal
If Multi RE:
- Remove
graceful-switchover
deactivate chassis redundancy graceful-switchover
request system configuration rescue save
(to ensure graceful-switchover is not in the rescue config)
- Install image on backup RE
request vmhost software add /var/tmp/$filename.tgz re1
- Reboot RE1
request vmhost reboot re1
- Once back up (
show chassis routing-engine
), perform RE switchover (impactful)request chassis routing-engine master switch
- Once done, repeat previous 3 steps for re0
- Rollback "Remove
graceful-switchover
"
If single RE:
- Install image on RE
request vmhost software add /var/tmp/$filename.tgz
- Reboot router
request vmhost reboot
Both single and dual RE:
- Check if router is healthy
show log messages | last
show system alarms
show ospf(3) interface
show bgp summary
- All green in Icinga and LibreNMS
Cleanup
request system storage cleanup
- If multi-RE, cleanup files on backup RE:
request system storage cleanup re1
- If multi-RE, cleanup files on backup RE:
- Remove Icinga and LibreNMS downtimes
- Rollback "Drain traffic away from router"
- Rollback VRRP change if any
- If eqiad/codfw rollback draining the pfw3 link
- Save rescue config (just in case)
request system configuration rescue save
- On vmhost devices, save the disk snapshot to the backup partition
request vmhost snapshot
for single RE devicesrequest vmhost snapshot routing-engine both
for dual RE devices