Juniper router upgrade

Known issues

Still valid

Junos 21.4R2-Sx and later will not work with system services ssh root-login deny (the FPC won't come online after the upgrade)
- https://supportportal.juniper.net/s/article/Junos-21-4-or-later-Root-login-is-required-for-copying-the-FPC-image-from-the-Junos-VM-to-the-Linux-host-during-upgrade-of-VM-Host-based-platforms?language=en_US

Obsolete

Junos 21.2R2-Sx have an incompatibility with older (at least 17.x) Junos, preventing VRRP adjacency to establish with a MD5 key
Certain REs have an Intel i40e on-board, and firmware needs to be upgraded before newer JunOS can be loaded. See notes at Juniper RE i40e firmware.
when upgrading from pre-21.2R1 to 21.2R1 or later, no-validate is required

Preparation

List on the task the new interesting features based on https://apps.juniper.net/feature-explorer/
Download the proper image to apt1001:/srv/junos/
- We now only use 64bits vmhost
- Based on upgrade task and Juniper recommended

All the steps bellow should be done with:
cumin1001:~$ sudo cookbook sre.network.prepare-upgrade <image-filename>.tgz <router-fqdn>

Make room for the image
- request system storage cleanup
- If multi-RE, cleanup files on backup RE: request system storage cleanup re1
Save rescue config (just in case)
- request system configuration rescue save
Copy image
- file copy "https://apt.wikimedia.org/junos/$filename.tgz" /var/tmp/ routing-instance mgmt_junos
- As data point this takes ~1h15 from eqiad to ulsfo
Check checksum
- file checksum md5 /var/tmp/$filename.tgz
- Compare with checksum on Juniper's website
Validate new image against existing config
- request vmhost software validate /var/tmp/$filename.tgz

Upgrade

Check if console port(s) is(/are) working
Depool site (optional)
1. (optional) if codfw, drain mw traffic sudo cookbook sre.mediawiki.route-traffic primary
Drain traffic away from router
1. NOT TESTED YET - apply GRACEFUL_SHUTDOWN - T320230
  - set protocols bgp graceful-shutdown sender
2. Disable the peers
  - set protocols bgp group Transit4 shutdown
  - set protocols bgp group Transit6 shutdown
  - set protocols bgp group IX4 shutdown
  - set protocols bgp group IX6 shutdown
  - Adjust OSPF metrics
  - If eqiad/codfw drain the pfw3 link:
    - set policy-options policy-statement BGP_fundraising_in term address then local-preference 50
    - set protocols bgp group fundraising metric-out 500
Ensure router is not VRRP master
- show vrrp summary
- set groups vrrp interfaces <*> unit <*> family inet address <*> vrrp-group <*> priority 70
- set groups vrrp interfaces <*> unit <*> family inet6 address <*> vrrp-inet6-group <*> priority 70
  - Note: if specific priorities are set on vrrp groups priority needs to be reduced on the specific groups also.
Downtime host in Icinga and Alert-manager
- sudo cookbook sre.hosts.downtime -r 'router upgrade' -t XXX -H 2 --force 'cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt'
- This needs to match the Icinga "hosts", cr3-ulsfo will match in AlertManager as well.
- NOTE: For devices with multiple REs you will probably find the mgmt hosts in Icinga named like 're0.cr3-esams.mgmt'
Double check site has been fully drained of traffic before proceeding:
- Check no traffic to LVS at site: https://grafana-rw.wikimedia.org/d/000000343/load-balancers-lvs
- Check Cloudflare DDoS tunnels are disabled for site: sudo cookbook sre.network.cf status all
- Check LibreNMS graphs for router in question: https://librenms.wikimedia.org/devices/type=network
Disable BGP sessions to LVS/PyBal load-balancers
- deactivate protocols bgp group PyBal

If Multi RE:

Remove graceful-switchover
- deactivate chassis redundancy graceful-switchover
- request system configuration rescue save (to ensure graceful-switchover is not in the rescue config)
Install image on backup RE
- request vmhost software add /var/tmp/$filename.tgz re1
Reboot RE1
- request vmhost reboot re1
Once back up (show chassis routing-engine), perform RE switchover (impactful)
- request chassis routing-engine master switch
Once done, repeat previous 3 steps for re0
Rollback "Remove graceful-switchover"

If single RE:

Install image on RE
- request vmhost software add /var/tmp/$filename.tgz
Reboot router
- request vmhost reboot

Both single and dual RE:

Check if router is healthy
- show log messages | last
- show system alarms
- show ospf(3) interface
- show bgp summary
- All green in Icinga and LibreNMS

Cleanup

- request system storage cleanup
  - If multi-RE, cleanup files on backup RE: request system storage cleanup re1
Remove Icinga and LibreNMS downtimes
Rollback "Drain traffic away from router"
Rollback VRRP change if any
If eqiad/codfw rollback draining the pfw3 link
Save rescue config (just in case)
- request system configuration rescue save
On vmhost devices, save the disk snapshot to the backup partition
- request vmhost snapshot for single RE devices
- request vmhost snapshot routing-engine both for dual RE devices