Wikimedia Cloud Services team/EnhancementProposals/SpareDisks

From Wikitech

Problem Statement

WMCS servers currently do not have hot spares disks added to their RAID configuration.

We depend on Dell's 24-hour support contract to deliver new replacements to us.

Since we standardize on RAID-10, a single disk failure means that the whole volume has no redundancy anymore and it's at the mercy of the remaining disk in the pair that failed to keep working until a replacement disk arrives, is actually installed and completes the rebuilding process. This means we could be running without redundancy for a few days.

Due to the way that RAID-10 and SSDs work, wear levels could be the same for both SSDs in a pair. So both are at high risk of failure, meaning the remaining disk could fail at any moment too.

Additionally, miscommunication with Dell support or shipping delays could mean additional delays. We also don't have 24x7 DC-Ops staff to start working on a disk replacement immediately and there are organizational challenges in monitoring/alerting that could add even more delays.

Proposal

Adopt a new RAID standard where 2 disks are defined as hot spares. They will become active immediately after a failure is detected by the RAID controller, giving all teams more time to react and reducing the time window where we are without redundancy in our RAID volumes.

In essence, trade storage capacity for faster MTTR.

Technical Impact

Here is the current situation for our various servers:

Server Vendor Disks Type Raw Capacity Current RAID-10 RAID-10 w/ spares Variation Current Disk Usage
labvirt1001 HP 16 15k 300GB 4.8TB 2.4TB 2.1TB -12.5% 1.2TB
labvirt1002 HP 16 15k 300GB 4.8TB 2.4TB 2.1TB -12.5% 1.2TB
labvirt1003 HP 16 15k 300GB 4.8TB 2.4TB 2.1TB -12.5% 0.9TB
labvirt1004 HP 16 15k 300GB 4.8TB 2.4TB 2.1TB -12.5% 1.2TB
labvirt1005 HP 16 15k 300GB 4.8TB 2.4TB 2.1TB -12.5% 0.9TB
labvirt1006 HP 16 15k 300GB 4.8TB 2.4TB 2.1TB -12.5% 0.7TB
labvirt1007 HP 16 15k 300GB 4.8TB 2.4TB 2.1TB -12.5% 1.3TB
labvirt1008 HP 16 15k 300GB 4.8TB 2.4TB 2.1TB -12.5% 0.5TB
cloudvirt1009 HP 16 15k 300GB 4.8TB 2.4TB 2.1TB -12.5% -
cloudvirt1012 HP 6 SSD 1.6TB 9.6TB 4.8TB 3.2TB -33% -
cloudvirt1013 HP 6 SSD 1.6TB 9.6TB 4.8TB 3.2TB -33% 0.3TB
cloudvirt1014 HP 6 SSD 1.6TB 9.6TB 4.8TB 3.2TB -33% 0.5TB
cloudvirt1015 Dell 10 SSD 1.6TB 16TB 8TB 6.4TB -20% -
cloudvirt1016 Dell 10 SSD 1.6TB 16TB 8TB 6.4TB -20% 2.1TB
cloudvirt1017 Dell 10 SSD 1.6TB 16TB 8TB 6.4TB -20% 1.5TB
cloudvirt1018 Dell 10 SSD 1.6TB 16TB 8TB 6.4TB -20% 1.1TB
cloudvirt1019 HP 10 SSD 1.6TB 16TB 8TB 6.4TB -20% 1.5TB
cloudvirt1020 HP 10 SSD 1.6TB 16TB 8TB 6.4TB -20% -
cloudvirt1021 Dell 10 SSD 1.6TB 16TB 8TB 6.4TB -20% 1.3TB
cloudvirt1022 Dell 10 SSD 1.6TB 16TB 8TB 6.4TB -20% 2.8TB
cloudvirt1023 Dell 10 SSD 1.6TB 16TB 8TB 6.4TB -20% 2.4TB
cloudvirt1024 Dell 10 SSD 1.6TB 16TB 8TB 6.4TB -20% 0.2TB
cloudvirt1025 Dell 6 SSD 1.8TB 10.8TB 5.4TB 3.6TB -33% 0.5TB
cloudvirt1026 Dell 6 SSD 1.8TB 10.8TB 5.4TB 3.6TB -33% 1.1TB
cloudvirt1027 Dell 6 SSD 1.8TB 10.8TB 5.4TB 3.6TB -33% 1TB
cloudvirt1028 Dell 6 SSD 1.8TB 10.8TB 5.4TB 3.6TB -33% 1TB
cloudvirt1029 Dell 6 SSD 1.8TB 10.8TB 5.4TB 3.6TB -33% 0.5TB
cloudvirt1030 Dell 6 SSD 1.8TB 10.8TB 5.4TB 3.6TB -33% 1.5TB
cloudvirtan1001 Dell 12 7.2k 4TB 48TB 24TB 20TB -17% -
cloudvirtan1002 Dell 12 7.2k 4TB 48TB 24TB 20TB -17% -
cloudvirtan1003 Dell 12 7.2k 4TB 48TB 24TB 20TB -17% -
cloudvirtan1004 Dell 12 7.2k 4TB 48TB 24TB 20TB -17% -
cloudvirtan1005 Dell 12 7.2k 4TB 48TB 24TB 20TB -17% -
labstore1004 Dell 26 7.2k 2TB 52TB 26TB 24TB -8% -
labstore1005 Dell 26 7.2k 2TB 52TB 26TB 24TB -8% -
labstore1006 HP 12 7.2k 6TB 72TB 36TB 30TB -17% -
labstore1007 HP 12 7.2k 6TB 72TB 36TB 30TB -17% -
cloudstore1008 Dell 12 7.2k 6TB 72TB 36TB 30TB -17% -
cloudstore1009 Dell 12 7.2k 6TB 72TB 36TB 30TB -17% -

Important notes:

  • Nominal disk capacity used (base 10 sizes published by vendor)
  • Disks dedicated to operating system are ignored
  • cloudvirtan* are "owned" by the Analytics team and may or may not be sized to allow for spares
  • labstore100{4,5,6,7} are scheduled to be decommissioned with instances on cloudvirt10{19,20} as the replacement

Timeline

If this proposal is accepted, we could adopt two strategies for deployment:

  • Piggyback on our migration to Stretch/Mitaka on the hypervisors and take each reimage opportunity to modify the RAID configuration
  • Schedule downtime for the remaining hypervisors that are already in Stretch/Mitaka so they're drained, reimaged and put back into production

Since draining hypervisors is a very disruptive process, the complete implementation of this proposal would have to take into account how much downtime we are comfortable with.

It's expected to be a year goal if not more.

Voting

Please add more stakeholders as needed. Vote Yes/No and a justification.

Name Vote Comment
Andrew Bogott - -
Arturo Borrero Yes We should increase robustness and resilience of CloudVPS. I know this involves capacity/budget/refresh planning.
Brooke Storm - -
Bryan Davis Yes Support for the piggyback strategy + investigating "converged infrastructure" idea of re-purposing cloudvirt local storage as Ceph storage that is exposed back to the cloudvirts for instance storage.
Giovanni Tirloni Yes Reasons: Engineer time is more expensive than cost of spare disks. Lack of redundancy is unacceptable in face of data loss (which has already occurred). We cannot maintain any meaningful SLA with humans in the critical path.

Decision

While some team members didn't cast their vote formally in this document, the majority seemed to agree that adopting hot spare disks was a good strategy in a meeting held on Feb 12.

We will:

  • Reconfigure RAID arrays to add hot spares at every opportunity we have (reimages, moving hypervisors to the new eqiad1 region, etc)
  • We will not drain existing hypervisors simply to reconfigure RAID because it's a too time-consuming process
  • New servers being bought for codfw and only serving dev/test purposes will have only 1 hot spare disk instead of 2 (like in production hypervisors)
  • Investigate using the hypervisors as Ceph nodes themselves in a "converged infrastructure" approach

RAID configuration

Initial RAID configuration is done using the last two disks as hot spares. As disks fail and get replaced the hot spares will end up being in different slots though.

Dell servers

Reboot and reconfigure it through the UI interface.

Remember to go into 'Advanced' while creating the volume and select 'Add hot spares' and 'Initialize'.

When selecting the disks for the volume, leave the last 2 disks unselected. Unintuitively, a new window will pop up after the volume is creating asking you to select the spare disks.

HP servers (Gen8)

There is an issue pressing F3 to delete existing RAID volume. If possible, run commands from Linux.

Show current status:

=> ctrl slot=0 pd all show

Smart Array P420i in Slot 0 (Embedded)

   array A

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK)

   array B

      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 300 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 300 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 300 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 300 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 300 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 300 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 300 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 300 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 300 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 300 GB, OK)
      physicaldrive 1I:1:13 (port 1I:box 1:bay 13, SAS, 300 GB, OK)
      physicaldrive 2I:1:14 (port 2I:box 1:bay 14, SAS, 300 GB, OK)
      physicaldrive 2I:1:15 (port 2I:box 1:bay 15, SAS, 300 GB, OK)
      physicaldrive 2I:1:16 (port 2I:box 1:bay 16, SAS, 300 GB, OK)
      physicaldrive 2I:1:17 (port 2I:box 1:bay 17, SAS, 300 GB, OK)
      physicaldrive 2I:1:18 (port 2I:box 1:bay 18, SAS, 300 GB, OK)

Delete existing RAID volume:

=> ctrl slot=0 ld 2 delete forced

Warning: Deleting an array can cause other array letters to become renamed.
         E.g. Deleting array A from arrays A,B,C will result in two remaining
         arrays A,B ... not B,C

Create new RAID volume (leaving 2 disks for spares):

=> ctrl slot=0 create type=ld raid=1+0 drives=1I:1:3,1I:1:4,1I:1:5,1I:1:6,1I:1:7,1I:1:8,1I:1:9,1I:1:10,1I:1:11,1I:1:12,1I:1:13,2I:1:14,2I:1:15,2I:1:16

Add hot spares to the array:

=> ctrl slot=0 array B add spares=2I:1:17,2I:1:18

Show RAID volume:

=> ctrl slot=0 ld 2 show          

Smart Array P420i in Slot 0 (Embedded)

   array B

      Logical Drive: 2
         Size: 1.9 TB
         Fault Tolerance: 1+0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Full Stripe Size: 1792 KB
         Status: OK
         Caching:  Enabled
         Unique Identifier: 600508B1001C2FD858B7C664FB32BECD
         Disk Name: /dev/sdb 
         Mount Points: None
         Logical Drive Label: 045EEA760014380311954D026CC
         Mirror Group 1:
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 300 GB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 300 GB, OK)
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 300 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 300 GB, OK)
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 300 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 300 GB, OK)
            physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 300 GB, OK)
         Mirror Group 2:
            physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 300 GB, OK)
            physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 300 GB, OK)
            physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 300 GB, OK)
            physicaldrive 1I:1:13 (port 1I:box 1:bay 13, SAS, 300 GB, OK)
            physicaldrive 2I:1:14 (port 2I:box 1:bay 14, SAS, 300 GB, OK)
            physicaldrive 2I:1:15 (port 2I:box 1:bay 15, SAS, 300 GB, OK)
            physicaldrive 2I:1:16 (port 2I:box 1:bay 16, SAS, 300 GB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Show all disks (confirm there are 2 spares):

=> ctrl slot=0 pd all show

Smart Array P420i in Slot 0 (Embedded)

   array A

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK)

   array B

      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 300 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 300 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 300 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 300 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 300 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 300 GB, OK)
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 300 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 300 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 300 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 300 GB, OK)
      physicaldrive 1I:1:13 (port 1I:box 1:bay 13, SAS, 300 GB, OK)
      physicaldrive 2I:1:14 (port 2I:box 1:bay 14, SAS, 300 GB, OK)
      physicaldrive 2I:1:15 (port 2I:box 1:bay 15, SAS, 300 GB, OK)
      physicaldrive 2I:1:16 (port 2I:box 1:bay 16, SAS, 300 GB, OK)
      physicaldrive 2I:1:17 (port 2I:box 1:bay 17, SAS, 300 GB, OK, spare)
      physicaldrive 2I:1:18 (port 2I:box 1:bay 18, SAS, 300 GB, OK, spare)

HP servers (Gen9)

Enter the HP RAID configuration:

  • Reboot server
  • Press ESC+9 to enter menu
  • Select RAID controler on slot 1
  • Select option to open configuration utility
  • Wait for "error: no such device: HPEZCD260" message to disappear

Verify current situation:

=> controller slot=1 ld 1 show  

Smart Array P840 in Slot 1

   Array A

      Logical Drive: 1
         Size: 7.3 TB
         Fault Tolerance: 1+0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Full Stripe Size: 1280 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Unique Identifier: 600508B1001CB0D3B3EFD3A1715AB007
         Disk Name: /dev/sdd 
         Mount Points: None
         Logical Drive Label: 0110F6E7PDNNF0ARH8015T82C5
         Mirror Group 1:
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA SSD, 1.6 TB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA SSD, 1.6 TB, OK)
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SATA SSD, 1.6 TB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SATA SSD, 1.6 TB, OK)
         Mirror Group 2:
            physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SATA SSD, 1.6 TB, OK)
            physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SATA SSD, 1.6 TB, OK)
         Drive Type: Data
         LD Acceleration Method: SSD Smart Path

Delete existing array:

=> ctrl slot=1 array A delete forced

Warning: Deleting the specified device(s) will result in data being lost.
         Continue? (y/n) y


Confirm all disks are now unassigned:

=> ctrl slot=1 pd all show

Smart Array P840 in Slot 1

   Unassigned

      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA SSD, 1.6 TB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA SSD, 1.6 TB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SATA SSD, 1.6 TB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SATA SSD, 1.6 TB, OK)

Create a new array / logic device (leaving two disks to be spares later):

=> ctrl slot=1 create type=ld drives=1I:1:5,1I:1:6,1I:1:7,1I:1:8,2I:1:1,2I:1:2,2I:1:3,2I:1:4 raid=1+0 forced

Warning: SSD Over Provisioning Optimization will be performed on the physical
         drives in this array. This process may take a long time and cause this
         application to appear unresponsive. Continue? (y/n)y

Add the last 2 disks as spares:

=> ctrl slot=1 array all add spares=2I:2:1,2I:2:2

Resources

Add papers, external links, etc, that support proposal