Talk:Server Lifecycle

From Wikitech
Jump to navigation Jump to search

Hey, how about using the discussion page :-P

Since we are committing to doing everything from puppet disable to power off in one shot (not in days but in an hour let's say), no point in waiting for the daily storedconfig cron cleanup job based on decomm.pp. So I removed that, and added some comments later about how we don't rely on that mechanism. We want to get away from that anyways since this file is eventually to be tossed.

Fixed order: we remove from puppet manifests etc before we disable puppet on the host. And we can do any of those removal steps and stop in the middle before we commit to the process leading to power off.

In theory one could write a script that did puppet disable/storedconfigs clean/remove cert/remove salt key and do all that with one command. This would make it impossible to back out if there were issues with icinga refresh but if we ensured that there wa always someone who could look at and fix those issues when this script is run, that would be ok. Note that puppet node clean (available from puppet 2.7.10 on) removes the cert and cleans up exported resources, so we should start using it instead of puppetstoredconfigclean.

Comments/complaints/corrections please... -- ariel (talk) 09:48, 4 November 2013 (UTC)

Introducing Netbox

Netbox will replace Racktables, the following one is the proposal to adapt our current Server Lifecycle to the introduction of Netbox

Netbox available statuses

Those are the currently available statuses in Netbox:

DEVICE_STATUS_OFFLINE   = 0
DEVICE_STATUS_ACTIVE    = 1
DEVICE_STATUS_PLANNED   = 2
DEVICE_STATUS_STAGED    = 3
DEVICE_STATUS_FAILED    = 4
DEVICE_STATUS_INVENTORY = 5

Lifecycle defined statuses

Lifecycle status Netbox status
requested none, not yet in Netbox
spare PLANNED
staged STAGED
active ACTIVE
failed FAILED
decommissioned INVENTORY
unracked OFFLINE
recycled none, not anymore in Netbox

Lifecycle transitions

Only the high level overview is described here, it will be integrated into the page itself.

Diagram of the Lifecycle transitions

requested -> spare

  • DC Ops receives the shipment to the datacenter (not yet racked)
  • DC Ops adds device to Netbox, with status PLANNED

spare -> staged

  • DC Ops racks the device, if not already racked
  • DC Ops performs the initial setup (operating system installed, puppet etc.)
  • DC Ops assigns a rack position in Netbox, and changes status from PLANNED to STAGED

staged -> active

  • service owner performs acceptance tests and (re)provisions their service
  • service owner changes Netbox's status from STAGED to ACTIVE

active -> decommissioned

  • service owner perform actions to remove it from production
  • service owner changes Netbox's status from ACTIVE to INVENTORY

active -> staged

This transition should be used when reimaging the first host of a cluster into a newer OS that will likely need to be tested extensively before putting it back in production. It can be used also in other occasions when a rollback of the STAGED -> ACTIVE transition is needed.

  • service owner perform actions to remove it from production
  • service owner changes Netbox's status from ACTIVE to STAGED

active -> failed

  • service owner perform actions to depool it from production
  • service owner changes Netbox's status from ACTIVE to FAILED

failed -> staged

  • DC Ops fixes the hardware failure
  • DC Ops changes Netbox's status from FAILED to STAGED

decommissioned -> spare

  • DC Ops wipe power down the host, renaming its hostname to its WMF Asset Tag. (Network switch port is also disabled.)
  • DC Ops changes Netbox's status from INVENTORY to PLANNED

decommissioned -> staged

This transition is for renames and immediate re-allocations into a different role

  • DC Ops wipe and reimage into role::spare the host with the new name
  • DC Ops changes Netbox's status from INVENTORY to STAGED

failed | spare | decommissioned -> unracked

When decommissioning a failed host beyond repair, an old host that was sitting as a spare, normal decommissioning

  • DC Ops unracks the device, and removes the row/rack association in Netbox
  • DC Ops changes Netbox's status to OFFLINE
  • DC Ops places device in the data center's storage unit
  • DC Ops adds device to the "Operations tracking" document, in the "unracked decomissioned" sheet [to be automated]
  • DC Ops (eventually) communicates this (e.g. on a quarterly basis) to Finance, to be written off the books

unracked -> recycled

  • DC Ops ships the device off to a recycler company
  • DC Ops adds the device to the "Operations tracking" document, in the "sold-decom servers" sheet and removes it from the "unracked decomissioned" sheet [to be automated]
  • DC Ops communicates the device to Finance as a "sold-off server"
  • DC Ops deletes the device from Netbox entirely (previously: moves to a decom rack in Racktables)

FAQ

  • The spare term is overloaded, as it's used to describe two different points in the lifecycle of a host.
    1. Host offline & wiped with just the management interface connected and online: spare lifecycle status
    2. Host online, has puppet role spare::system: staged lifecycle status and normal hostname. These hosts are either about to be pushed into service, or are in the process of being removed from service.
  • Renames: active -> decommissioned -> staged with the new name, skipping the spare status
  • Decommission of a production host back into the spare pool: it's a rename to its WMF asset tag.
  • Relationship between racked/unracked, power on/off and their Netbox status:
Netbox status Racked Power
PLANNED yes or no off
STAGED yes on
ACTIVE yes on
FAILED yes on or off
INVENTORY yes on
OFFLINE no off

Open questions

  • Power on/off status: if needed we could add a custom field to Netbox to track the power status, not sure it's worth though and can be added at a later stage too.
  • Should the wipe+reimage process be done on active -> decommissioned by the service owner instead of doing it in the next step?
  • When we want to start to track IPs and VLANs association in Netbox

Changes to be made to the page sections

This is a minimal list of required changes to be made to this wiki page to include the Netbox steps. We can then restructure the page to follow the above lifecycle statuses.

  • Server states
    • Requested: kept as is
    • Existing System Allocation: to be split into the DECOMMISSION -> STAGED and STAGED -> ACTIVE, as it is right now is basically only the latter transition.
    • Ordered: kept as is
    • Post order: kept as is
    • Receiving Systems On-Site: change insert into Racktables with insert into Netbox with status PLANNED. If the hostname is not available at this time, insert it without an hostname, the asset tag has it's own field.
    • Racked: change update Racktables to update Netbox with the hostname (if it was inserted without) and rack location
    • Installation: if the hostname is chosen at this time, update Netbox with it.
    • In Service: kept as is
    • Reinstallation: add a note for the ACTIVE -> STAGED transition to put the host in Netbox in STAGED if needed (not for normal depool -> reimage -> pool transitions).
    • Reclaim to Spares OR Decommission
      • Steps for ANY Opsen: add change Netbox status from ACTIVE reimage into spare::system keeping the systems current hostname (unless already re-assigned to a different name/role).
      • Steps for DC-OPS (with network switch access): add change Netbox status from INVENTORY to OFFLINE, wipe disks, rename host to WMF asset tag in netbox and on disabled network switch port.
  • wmf-auto-reimage: kept as is (an CLI option could be added later on to the script to automatically change Netbox status on reimages that need to be put in STAGED).
  • Server reimage + rename: add active -> decommissioned -> staged status changes
  • Position Assignments: kept as is
  • See also: kept as is