Talk:Server Lifecycle
Hey, how about using the discussion page :-P
Since we are committing to doing everything from puppet disable to power off in one shot (not in days but in an hour let's say), no point in waiting for the daily storedconfig cron cleanup job based on decomm.pp. So I removed that, and added some comments later about how we don't rely on that mechanism. We want to get away from that anyways since this file is eventually to be tossed.
Fixed order: we remove from puppet manifests etc before we disable puppet on the host. And we can do any of those removal steps and stop in the middle before we commit to the process leading to power off.
In theory one could write a script that did puppet disable/storedconfigs clean/remove cert/remove salt key and do all that with one command. This would make it impossible to back out if there were issues with icinga refresh but if we ensured that there wa always someone who could look at and fix those issues when this script is run, that would be ok. Note that puppet node clean (available from puppet 2.7.10 on) removes the cert and cleans up exported resources, so we should start using it instead of puppetstoredconfigclean.
Comments/complaints/corrections please... -- ariel (talk) 09:48, 4 November 2013 (UTC)
Introducing Netbox
Netbox will replace Racktables, the following one is the proposal to adapt our current Server Lifecycle to the introduction of Netbox
Netbox available statuses
Those are the currently available statuses in Netbox:
DEVICE_STATUS_OFFLINE = 0
DEVICE_STATUS_ACTIVE = 1
DEVICE_STATUS_PLANNED = 2
DEVICE_STATUS_STAGED = 3
DEVICE_STATUS_FAILED = 4
DEVICE_STATUS_INVENTORY = 5
Lifecycle defined statuses
Lifecycle status | Netbox status |
---|---|
requested | none, not yet in Netbox |
spare | PLANNED |
staged | STAGED |
active | ACTIVE |
failed | FAILED |
decommissioned | INVENTORY |
unracked | OFFLINE |
recycled | none, not anymore in Netbox |
Lifecycle transitions
Only the high level overview is described here, it will be integrated into the page itself.
requested -> spare
- DC Ops receives the shipment to the datacenter (not yet racked)
- DC Ops adds device to Netbox, with status
PLANNED
spare -> staged
- DC Ops racks the device, if not already racked
- DC Ops performs the initial setup (operating system installed, puppet etc.)
- DC Ops assigns a rack position in Netbox, and changes status from
PLANNED
toSTAGED
staged -> active
- service owner performs acceptance tests and (re)provisions their service
- service owner changes Netbox's status from
STAGED
toACTIVE
active -> decommissioned
- service owner perform actions to remove it from production
- service owner changes Netbox's status from
ACTIVE
toINVENTORY
active -> staged
This transition should be used when reimaging the first host of a cluster into a newer OS that will likely need to be tested extensively before putting it back in production. It can be used also in other occasions when a rollback of the STAGED -> ACTIVE
transition is needed.
- service owner perform actions to remove it from production
- service owner changes Netbox's status from
ACTIVE
toSTAGED
active -> failed
- service owner perform actions to depool it from production
- service owner changes Netbox's status from
ACTIVE
toFAILED
failed -> staged
- DC Ops fixes the hardware failure
- DC Ops changes Netbox's status from
FAILED
toSTAGED
decommissioned -> spare
- DC Ops wipe power down the host, renaming its hostname to its WMF Asset Tag. (Network switch port is also disabled.)
- DC Ops changes Netbox's status from
INVENTORY
toPLANNED
decommissioned -> staged
This transition is for renames and immediate re-allocations into a different role
- DC Ops wipe and reimage into
role::spare
the host with the new name - DC Ops changes Netbox's status from
INVENTORY
toSTAGED
failed | spare | decommissioned -> unracked
When decommissioning a failed host beyond repair, an old host that was sitting as a spare, normal decommissioning
- DC Ops unracks the device, and removes the row/rack association in Netbox
- DC Ops changes Netbox's status to
OFFLINE
- DC Ops places device in the data center's storage unit
- DC Ops adds device to the "Operations tracking" document, in the "unracked decomissioned" sheet [to be automated]
- DC Ops (eventually) communicates this (e.g. on a quarterly basis) to Finance, to be written off the books
unracked -> recycled
- DC Ops ships the device off to a recycler company
- DC Ops adds the device to the "Operations tracking" document, in the "sold-decom servers" sheet and removes it from the "unracked decomissioned" sheet [to be automated]
- DC Ops communicates the device to Finance as a "sold-off server"
- DC Ops deletes the device from Netbox entirely (previously: moves to a decom rack in Racktables)
FAQ
- The spare term is overloaded, as it's used to describe two different points in the lifecycle of a host.
- Host offline & wiped with just the management interface connected and online:
spare
lifecycle status - Host online, has puppet role
spare::system
:staged
lifecycle status and normal hostname. These hosts are either about to be pushed into service, or are in the process of being removed from service.
- Host offline & wiped with just the management interface connected and online:
- Renames:
active -> decommissioned -> staged
with the new name, skipping thespare
status - Decommission of a production host back into the spare pool: it's a rename to its WMF asset tag.
- Relationship between racked/unracked, power on/off and their Netbox status:
Netbox status | Racked | Power |
---|---|---|
PLANNED | yes or no | off |
STAGED | yes | on |
ACTIVE | yes | on |
FAILED | yes | on or off |
INVENTORY | yes | on |
OFFLINE | no | off |
Open questions
- Power on/off status: if needed we could add a custom field to Netbox to track the power status, not sure it's worth though and can be added at a later stage too.
- Should the wipe+reimage process be done on
active -> decommissioned
by the service owner instead of doing it in the next step? - When we want to start to track IPs and VLANs association in Netbox
Changes to be made to the page sections
This is a minimal list of required changes to be made to this wiki page to include the Netbox steps. We can then restructure the page to follow the above lifecycle statuses.
- Server states
- Requested: kept as is
- Existing System Allocation: to be split into the
DECOMMISSION -> STAGED
andSTAGED -> ACTIVE
, as it is right now is basically only the latter transition. - Ordered: kept as is
- Post order: kept as is
- Receiving Systems On-Site: change insert into Racktables with insert into Netbox with status
PLANNED
. If the hostname is not available at this time, insert it without an hostname, the asset tag has it's own field. - Racked: change update Racktables to update Netbox with the hostname (if it was inserted without) and rack location
- Installation: if the hostname is chosen at this time, update Netbox with it.
- In Service: kept as is
- Reinstallation: add a note for the
ACTIVE -> STAGED
transition to put the host in Netbox inSTAGED
if needed (not for normaldepool -> reimage -> pool
transitions). - Reclaim to Spares OR Decommission
- Steps for ANY Opsen: add change Netbox status from
ACTIVE
reimage intospare::system
keeping the systems current hostname (unless already re-assigned to a different name/role). - Steps for DC-OPS (with network switch access): add change Netbox status from
INVENTORY
toOFFLINE
, wipe disks, rename host to WMF asset tag in netbox and on disabled network switch port.
- Steps for ANY Opsen: add change Netbox status from
- wmf-auto-reimage: kept as is (an CLI option could be added later on to the script to automatically change Netbox status on reimages that need to be put in
STAGED
). - Server reimage + rename: add
active -> decommissioned -> staged
status changes - Position Assignments: kept as is
- See also: kept as is
Stages
The documentation here is not showing realitiy- the stages on netbox are not the same as the ones mentioned here. It would be nice whoever maintains netbox or this docs to review it and sync it.
My other request is to explain a bit more the stages on the initial paragraph. Active is more or less clear, but there are several stages of "inactivity" that aren't 100% clear to me. ---- Jcrespo 14:47, 5 June 2019 (UTC)