Juniper router upgrade

From Wikitech

Known issues

Still valid

Obsolete

  • Junos 21.2R2-Sx have an incompatibility with older (at least 17.x) Junos, preventing VRRP adjacency to establish with a MD5 key
  • Certain REs have an Intel i40e on-board, and firmware needs to be upgraded before newer JunOS can be loaded. See notes at Juniper RE i40e firmware.
  • when upgrading from pre-21.2R1 to 21.2R1 or later, no-validate is required

Preparation

  1. List on the task the new interesting features based on https://apps.juniper.net/feature-explorer/
  2. Download the proper image to apt1001:/srv/junos/

All the steps bellow should be done with:
cumin1001:~$ sudo cookbook sre.network.prepare-upgrade <image-filename>.tgz <router-fqdn>

  1. Make room for the image
    • request system storage cleanup
    • If multi-RE, cleanup files on backup RE: request system storage cleanup re1
  2. Save rescue config (just in case)
    • request system configuration rescue save
  3. Copy image
  4. Check checksum
    • file checksum md5 /var/tmp/$filename.tgz
    • Compare with checksum on Juniper's website
  5. Validate new image against existing config
    • request vmhost software validate /var/tmp/$filename.tgz

Upgrade

  1. Check if console port(s) is(/are) working
  2. Depool site (optional)
    1. (optional) if codfw, drain mw traffic sudo cookbook sre.mediawiki.route-traffic primary
  3. Drain traffic away from router
    1. NOT TESTED YET - apply GRACEFUL_SHUTDOWN - T320230
      • set protocols bgp graceful-shutdown sender
    2. Disable the peers
      • set protocols bgp group Transit4 shutdown
      • set protocols bgp group Transit6 shutdown
      • set protocols bgp group IX4 shutdown
      • set protocols bgp group IX6 shutdown
      • Adjust OSPF metrics
      • If eqiad/codfw drain the pfw3 link:
        • set policy-options policy-statement BGP_fundraising_in term address then local-preference 50
        • set protocols bgp group fundraising metric-out 500
  4. Ensure router is not VRRP master
    • show vrrp summary
    • set groups vrrp interfaces <*> unit <*> family inet address <*> vrrp-group <*> priority 70    
    • set groups vrrp interfaces <*> unit <*> family inet6 address <*> vrrp-inet6-group <*> priority 70
      • Note: if specific priorities are set on vrrp groups priority needs to be reduced on the specific groups also.
  5. Downtime host in Icinga and Alert-manager
    • sudo cookbook sre.hosts.downtime -r 'router upgrade' -t XXX -H 2 --force 'cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt'
    • This needs to match the Icinga "hosts", cr3-ulsfo will match in AlertManager as well.
    • NOTE: For devices with multiple REs you will probably find the mgmt hosts in Icinga named like 're0.cr3-esams.mgmt'
  6. Double check site has been fully drained of traffic before proceeding:
  7. Disable BGP sessions to LVS/PyBal load-balancers
    • deactivate protocols bgp group PyBal

If Multi RE:

  1. Remove graceful-switchover
    • deactivate chassis redundancy graceful-switchover
    • request system configuration rescue save (to ensure graceful-switchover is not in the rescue config)
  2. Install image on backup RE
    • request vmhost software add /var/tmp/$filename.tgz re1
  3. Reboot RE1
    • request vmhost reboot re1
  4. Once back up (show chassis routing-engine), perform RE switchover (impactful)
    • request chassis routing-engine master switch
  5. Once done, repeat previous 3 steps for re0
  6. Rollback "Remove graceful-switchover"

If single RE:

  1. Install image on RE
    • request vmhost software add /var/tmp/$filename.tgz
  2. Reboot router
    • request vmhost reboot

Both single and dual RE:

  1. Check if router is healthy
    • show log messages | last
    • show system alarms
    • show ospf(3) interface
    • show bgp summary
    • All green in Icinga and LibreNMS

Cleanup

    • request system storage cleanup
      • If multi-RE, cleanup files on backup RE: request system storage cleanup re1
  1. Remove Icinga and LibreNMS downtimes
  2. Rollback "Drain traffic away from router"
  3. Rollback VRRP change if any
  4. If eqiad/codfw rollback draining the pfw3 link
  5. Save rescue config (just in case)
    • request system configuration rescue save
  6. On vmhost devices, save the disk snapshot to the backup partition
    • request vmhost snapshot for single RE devices
    • request vmhost snapshot routing-engine both for dual RE devices