Wikimedia Cloud Services team/EnhancementProposals/SpareDisks

Problem Statement

WMCS servers currently do not have hot spares disks added to their RAID configuration.

We depend on Dell's 24-hour support contract to deliver new replacements to us.

Since we standardize on RAID-10, a single disk failure means that the whole volume has no redundancy anymore and it's at the mercy of the remaining disk in the pair that failed to keep working until a replacement disk arrives, is actually installed and completes the rebuilding process. This means we could be running without redundancy for a few days.

Due to the way that RAID-10 and SSDs work, wear levels could be the same for both SSDs in a pair. So both are at high risk of failure, meaning the remaining disk could fail at any moment too.

Additionally, miscommunication with Dell support or shipping delays could mean additional delays. We also don't have 24x7 DC-Ops staff to start working on a disk replacement immediately and there are organizational challenges in monitoring/alerting that could add even more delays.

Proposal

Adopt a new RAID standard where 2 disks are defined as hot spares. They will become active immediately after a failure is detected by the RAID controller, giving all teams more time to react and reducing the time window where we are without redundancy in our RAID volumes.

In essence, trade storage capacity for faster MTTR.

Technical Impact

Here is the current situation for our various servers:

Server	Vendor	Disks	Type	Raw Capacity	Current RAID-10	RAID-10 w/ spares	Variation	Current Disk Usage
labvirt1001	HP	16	15k 300GB	4.8TB	2.4TB	2.1TB	-12.5%	1.2TB
labvirt1002	HP	16	15k 300GB	4.8TB	2.4TB	2.1TB	-12.5%	1.2TB
labvirt1003	HP	16	15k 300GB	4.8TB	2.4TB	2.1TB	-12.5%	0.9TB
labvirt1004	HP	16	15k 300GB	4.8TB	2.4TB	2.1TB	-12.5%	1.2TB
labvirt1005	HP	16	15k 300GB	4.8TB	2.4TB	2.1TB	-12.5%	0.9TB
labvirt1006	HP	16	15k 300GB	4.8TB	2.4TB	2.1TB	-12.5%	0.7TB
labvirt1007	HP	16	15k 300GB	4.8TB	2.4TB	2.1TB	-12.5%	1.3TB
labvirt1008	HP	16	15k 300GB	4.8TB	2.4TB	2.1TB	-12.5%	0.5TB
cloudvirt1009	HP	16	15k 300GB	4.8TB	2.4TB	2.1TB	-12.5%	-
cloudvirt1012	HP	6	SSD 1.6TB	9.6TB	4.8TB	3.2TB	-33%	-
cloudvirt1013	HP	6	SSD 1.6TB	9.6TB	4.8TB	3.2TB	-33%	0.3TB
cloudvirt1014	HP	6	SSD 1.6TB	9.6TB	4.8TB	3.2TB	-33%	0.5TB
cloudvirt1015	Dell	10	SSD 1.6TB	16TB	8TB	6.4TB	-20%	-
cloudvirt1016	Dell	10	SSD 1.6TB	16TB	8TB	6.4TB	-20%	2.1TB
cloudvirt1017	Dell	10	SSD 1.6TB	16TB	8TB	6.4TB	-20%	1.5TB
cloudvirt1018	Dell	10	SSD 1.6TB	16TB	8TB	6.4TB	-20%	1.1TB
cloudvirt1019	HP	10	SSD 1.6TB	16TB	8TB	6.4TB	-20%	1.5TB
cloudvirt1020	HP	10	SSD 1.6TB	16TB	8TB	6.4TB	-20%	-
cloudvirt1021	Dell	10	SSD 1.6TB	16TB	8TB	6.4TB	-20%	1.3TB
cloudvirt1022	Dell	10	SSD 1.6TB	16TB	8TB	6.4TB	-20%	2.8TB
cloudvirt1023	Dell	10	SSD 1.6TB	16TB	8TB	6.4TB	-20%	2.4TB
cloudvirt1024	Dell	10	SSD 1.6TB	16TB	8TB	6.4TB	-20%	0.2TB
cloudvirt1025	Dell	6	SSD 1.8TB	10.8TB	5.4TB	3.6TB	-33%	0.5TB
cloudvirt1026	Dell	6	SSD 1.8TB	10.8TB	5.4TB	3.6TB	-33%	1.1TB
cloudvirt1027	Dell	6	SSD 1.8TB	10.8TB	5.4TB	3.6TB	-33%	1TB
cloudvirt1028	Dell	6	SSD 1.8TB	10.8TB	5.4TB	3.6TB	-33%	1TB
cloudvirt1029	Dell	6	SSD 1.8TB	10.8TB	5.4TB	3.6TB	-33%	0.5TB
cloudvirt1030	Dell	6	SSD 1.8TB	10.8TB	5.4TB	3.6TB	-33%	1.5TB
cloudvirtan1001	Dell	12	7.2k 4TB	48TB	24TB	20TB	-17%	-
cloudvirtan1002	Dell	12	7.2k 4TB	48TB	24TB	20TB	-17%	-
cloudvirtan1003	Dell	12	7.2k 4TB	48TB	24TB	20TB	-17%	-
cloudvirtan1004	Dell	12	7.2k 4TB	48TB	24TB	20TB	-17%	-
cloudvirtan1005	Dell	12	7.2k 4TB	48TB	24TB	20TB	-17%	-
labstore1004	Dell	26	7.2k 2TB	52TB	26TB	24TB	-8%	-
labstore1005	Dell	26	7.2k 2TB	52TB	26TB	24TB	-8%	-
labstore1006	HP	12	7.2k 6TB	72TB	36TB	30TB	-17%	-
labstore1007	HP	12	7.2k 6TB	72TB	36TB	30TB	-17%	-
cloudstore1008	Dell	12	7.2k 6TB	72TB	36TB	30TB	-17%	-
cloudstore1009	Dell	12	7.2k 6TB	72TB	36TB	30TB	-17%	-

Important notes:

Nominal disk capacity used (base 10 sizes published by vendor)
Disks dedicated to operating system are ignored
cloudvirtan* are "owned" by the Analytics team and may or may not be sized to allow for spares
labstore100{4,5,6,7} are scheduled to be decommissioned with instances on cloudvirt10{19,20} as the replacement

Timeline

If this proposal is accepted, we could adopt two strategies for deployment:

Piggyback on our migration to Stretch/Mitaka on the hypervisors and take each reimage opportunity to modify the RAID configuration
Schedule downtime for the remaining hypervisors that are already in Stretch/Mitaka so they're drained, reimaged and put back into production

Since draining hypervisors is a very disruptive process, the complete implementation of this proposal would have to take into account how much downtime we are comfortable with.

It's expected to be a year goal if not more.

Voting

Please add more stakeholders as needed. Vote Yes/No and a justification.

Name	Vote	Comment
Andrew Bogott	-	-
Arturo Borrero	Yes	We should increase robustness and resilience of CloudVPS. I know this involves capacity/budget/refresh planning.
Brooke Storm	-	-
Bryan Davis	Yes	Support for the piggyback strategy + investigating "converged infrastructure" idea of re-purposing cloudvirt local storage as Ceph storage that is exposed back to the cloudvirts for instance storage.
Giovanni Tirloni	Yes	Reasons: Engineer time is more expensive than cost of spare disks. Lack of redundancy is unacceptable in face of data loss (which has already occurred). We cannot maintain any meaningful SLA with humans in the critical path.

Decision

While some team members didn't cast their vote formally in this document, the majority seemed to agree that adopting hot spare disks was a good strategy in a meeting held on Feb 12.

We will:

Reconfigure RAID arrays to add hot spares at every opportunity we have (reimages, moving hypervisors to the new eqiad1 region, etc)
We will not drain existing hypervisors simply to reconfigure RAID because it's a too time-consuming process
New servers being bought for codfw and only serving dev/test purposes will have only 1 hot spare disk instead of 2 (like in production hypervisors)
Investigate using the hypervisors as Ceph nodes themselves in a "converged infrastructure" approach

RAID configuration

Initial RAID configuration is done using the last two disks as hot spares. As disks fail and get replaced the hot spares will end up being in different slots though.

Dell servers

Reboot and reconfigure it through the UI interface.

Remember to go into 'Advanced' while creating the volume and select 'Add hot spares' and 'Initialize'.

When selecting the disks for the volume, leave the last 2 disks unselected. Unintuitively, a new window will pop up after the volume is creating asking you to select the spare disks.

HP servers (Gen8)

There is an issue pressing F3 to delete existing RAID volume. If possible, run commands from Linux.

Show current status:

=> ctrl slot=0 pd all show

Smart Array P420i in Slot 0 (Embedded)

   array A

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK)

   array B

      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 300 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 300 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 300 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 300 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 300 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 300 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 300 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 300 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 300 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 300 GB, OK)
      physicaldrive 1I:1:13 (port 1I:box 1:bay 13, SAS, 300 GB, OK)
      physicaldrive 2I:1:14 (port 2I:box 1:bay 14, SAS, 300 GB, OK)
      physicaldrive 2I:1:15 (port 2I:box 1:bay 15, SAS, 300 GB, OK)
      physicaldrive 2I:1:16 (port 2I:box 1:bay 16, SAS, 300 GB, OK)
      physicaldrive 2I:1:17 (port 2I:box 1:bay 17, SAS, 300 GB, OK)
      physicaldrive 2I:1:18 (port 2I:box 1:bay 18, SAS, 300 GB, OK)

Delete existing RAID volume:

=> ctrl slot=0 ld 2 delete forced

Warning: Deleting an array can cause other array letters to become renamed.
         E.g. Deleting array A from arrays A,B,C will result in two remaining
         arrays A,B ... not B,C

Create new RAID volume (leaving 2 disks for spares):

=> ctrl slot=0 create type=ld raid=1+0 drives=1I:1:3,1I:1:4,1I:1:5,1I:1:6,1I:1:7,1I:1:8,1I:1:9,1I:1:10,1I:1:11,1I:1:12,1I:1:13,2I:1:14,2I:1:15,2I:1:16

Add hot spares to the array:

=> ctrl slot=0 array B add spares=2I:1:17,2I:1:18

Show RAID volume:

=> ctrl slot=0 ld 2 show          

Smart Array P420i in Slot 0 (Embedded)

   array B

      Logical Drive: 2
         Size: 1.9 TB
         Fault Tolerance: 1+0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Full Stripe Size: 1792 KB
         Status: OK
         Caching:  Enabled
         Unique Identifier: 600508B1001C2FD858B7C664FB32BECD
         Disk Name: /dev/sdb 
         Mount Points: None
         Logical Drive Label: 045EEA760014380311954D026CC
         Mirror Group 1:
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 300 GB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 300 GB, OK)
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 300 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 300 GB, OK)
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 300 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 300 GB, OK)
            physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 300 GB, OK)
         Mirror Group 2:
            physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 300 GB, OK)
            physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 300 GB, OK)
            physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 300 GB, OK)
            physicaldrive 1I:1:13 (port 1I:box 1:bay 13, SAS, 300 GB, OK)
            physicaldrive 2I:1:14 (port 2I:box 1:bay 14, SAS, 300 GB, OK)
            physicaldrive 2I:1:15 (port 2I:box 1:bay 15, SAS, 300 GB, OK)
            physicaldrive 2I:1:16 (port 2I:box 1:bay 16, SAS, 300 GB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Show all disks (confirm there are 2 spares):

=> ctrl slot=0 pd all show

Smart Array P420i in Slot 0 (Embedded)

   array A

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK)

   array B

      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 300 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 300 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 300 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 300 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 300 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 300 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 300 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 300 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 300 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 300 GB, OK)
      physicaldrive 1I:1:13 (port 1I:box 1:bay 13, SAS, 300 GB, OK)
      physicaldrive 2I:1:14 (port 2I:box 1:bay 14, SAS, 300 GB, OK)
      physicaldrive 2I:1:15 (port 2I:box 1:bay 15, SAS, 300 GB, OK)
      physicaldrive 2I:1:16 (port 2I:box 1:bay 16, SAS, 300 GB, OK)
      physicaldrive 2I:1:17 (port 2I:box 1:bay 17, SAS, 300 GB, OK, spare)
      physicaldrive 2I:1:18 (port 2I:box 1:bay 18, SAS, 300 GB, OK, spare)

HP servers (Gen9)

Enter the HP RAID configuration:

Reboot server
Press ESC+9 to enter menu
Select RAID controler on slot 1
Select option to open configuration utility
Wait for "error: no such device: HPEZCD260" message to disappear

Verify current situation:

=> controller slot=1 ld 1 show  

Smart Array P840 in Slot 1

   Array A

      Logical Drive: 1
         Size: 7.3 TB
         Fault Tolerance: 1+0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Full Stripe Size: 1280 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Unique Identifier: 600508B1001CB0D3B3EFD3A1715AB007
         Disk Name: /dev/sdd 
         Mount Points: None
         Logical Drive Label: 0110F6E7PDNNF0ARH8015T82C5
         Mirror Group 1:
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA SSD, 1.6 TB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA SSD, 1.6 TB, OK)
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SATA SSD, 1.6 TB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SATA SSD, 1.6 TB, OK)
         Mirror Group 2:
            physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SATA SSD, 1.6 TB, OK)
         Drive Type: Data
         LD Acceleration Method: SSD Smart Path

Delete existing array:

=> ctrl slot=1 array A delete forced

Warning: Deleting the specified device(s) will result in data being lost.
         Continue? (y/n) y

Confirm all disks are now unassigned:

=> ctrl slot=1 pd all show

Smart Array P840 in Slot 1

   Unassigned

      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA SSD, 1.6 TB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA SSD, 1.6 TB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SATA SSD, 1.6 TB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SATA SSD, 1.6 TB, OK)

Create a new array / logic device (leaving two disks to be spares later):

=> ctrl slot=1 create type=ld drives=1I:1:5,1I:1:6,1I:1:7,1I:1:8,2I:1:1,2I:1:2,2I:1:3,2I:1:4 raid=1+0 forced

Warning: SSD Over Provisioning Optimization will be performed on the physical
         drives in this array. This process may take a long time and cause this
         application to appear unresponsive. Continue? (y/n)y

Add the last 2 disks as spares:

=> ctrl slot=1 array all add spares=2I:2:1,2I:2:2

Resources

Add papers, external links, etc, that support proposal