Wikimedia Cloud Services team/EnhancementProposals/SpareDisks
Problem Statement
WMCS servers currently do not have hot spares disks added to their RAID configuration.
We depend on Dell's 24-hour support contract to deliver new replacements to us.
Since we standardize on RAID-10, a single disk failure means that the whole volume has no redundancy anymore and it's at the mercy of the remaining disk in the pair that failed to keep working until a replacement disk arrives, is actually installed and completes the rebuilding process. This means we could be running without redundancy for a few days.
Due to the way that RAID-10 and SSDs work, wear levels could be the same for both SSDs in a pair. So both are at high risk of failure, meaning the remaining disk could fail at any moment too.
Additionally, miscommunication with Dell support or shipping delays could mean additional delays. We also don't have 24x7 DC-Ops staff to start working on a disk replacement immediately and there are organizational challenges in monitoring/alerting that could add even more delays.
Proposal
Adopt a new RAID standard where 2 disks are defined as hot spares. They will become active immediately after a failure is detected by the RAID controller, giving all teams more time to react and reducing the time window where we are without redundancy in our RAID volumes.
In essence, trade storage capacity for faster MTTR.
Technical Impact
Here is the current situation for our various servers:
Server | Vendor | Disks | Type | Raw Capacity | Current RAID-10 | RAID-10 w/ spares | Variation | Current Disk Usage |
---|---|---|---|---|---|---|---|---|
labvirt1001 | HP | 16 | 15k 300GB | 4.8TB | 2.4TB | 2.1TB | -12.5% | 1.2TB |
labvirt1002 | HP | 16 | 15k 300GB | 4.8TB | 2.4TB | 2.1TB | -12.5% | 1.2TB |
labvirt1003 | HP | 16 | 15k 300GB | 4.8TB | 2.4TB | 2.1TB | -12.5% | 0.9TB |
labvirt1004 | HP | 16 | 15k 300GB | 4.8TB | 2.4TB | 2.1TB | -12.5% | 1.2TB |
labvirt1005 | HP | 16 | 15k 300GB | 4.8TB | 2.4TB | 2.1TB | -12.5% | 0.9TB |
labvirt1006 | HP | 16 | 15k 300GB | 4.8TB | 2.4TB | 2.1TB | -12.5% | 0.7TB |
labvirt1007 | HP | 16 | 15k 300GB | 4.8TB | 2.4TB | 2.1TB | -12.5% | 1.3TB |
labvirt1008 | HP | 16 | 15k 300GB | 4.8TB | 2.4TB | 2.1TB | -12.5% | 0.5TB |
cloudvirt1009 | HP | 16 | 15k 300GB | 4.8TB | 2.4TB | 2.1TB | -12.5% | - |
cloudvirt1012 | HP | 6 | SSD 1.6TB | 9.6TB | 4.8TB | 3.2TB | -33% | - |
cloudvirt1013 | HP | 6 | SSD 1.6TB | 9.6TB | 4.8TB | 3.2TB | -33% | 0.3TB |
cloudvirt1014 | HP | 6 | SSD 1.6TB | 9.6TB | 4.8TB | 3.2TB | -33% | 0.5TB |
cloudvirt1015 | Dell | 10 | SSD 1.6TB | 16TB | 8TB | 6.4TB | -20% | - |
cloudvirt1016 | Dell | 10 | SSD 1.6TB | 16TB | 8TB | 6.4TB | -20% | 2.1TB |
cloudvirt1017 | Dell | 10 | SSD 1.6TB | 16TB | 8TB | 6.4TB | -20% | 1.5TB |
cloudvirt1018 | Dell | 10 | SSD 1.6TB | 16TB | 8TB | 6.4TB | -20% | 1.1TB |
cloudvirt1019 | HP | 10 | SSD 1.6TB | 16TB | 8TB | 6.4TB | -20% | 1.5TB |
cloudvirt1020 | HP | 10 | SSD 1.6TB | 16TB | 8TB | 6.4TB | -20% | - |
cloudvirt1021 | Dell | 10 | SSD 1.6TB | 16TB | 8TB | 6.4TB | -20% | 1.3TB |
cloudvirt1022 | Dell | 10 | SSD 1.6TB | 16TB | 8TB | 6.4TB | -20% | 2.8TB |
cloudvirt1023 | Dell | 10 | SSD 1.6TB | 16TB | 8TB | 6.4TB | -20% | 2.4TB |
cloudvirt1024 | Dell | 10 | SSD 1.6TB | 16TB | 8TB | 6.4TB | -20% | 0.2TB |
cloudvirt1025 | Dell | 6 | SSD 1.8TB | 10.8TB | 5.4TB | 3.6TB | -33% | 0.5TB |
cloudvirt1026 | Dell | 6 | SSD 1.8TB | 10.8TB | 5.4TB | 3.6TB | -33% | 1.1TB |
cloudvirt1027 | Dell | 6 | SSD 1.8TB | 10.8TB | 5.4TB | 3.6TB | -33% | 1TB |
cloudvirt1028 | Dell | 6 | SSD 1.8TB | 10.8TB | 5.4TB | 3.6TB | -33% | 1TB |
cloudvirt1029 | Dell | 6 | SSD 1.8TB | 10.8TB | 5.4TB | 3.6TB | -33% | 0.5TB |
cloudvirt1030 | Dell | 6 | SSD 1.8TB | 10.8TB | 5.4TB | 3.6TB | -33% | 1.5TB |
cloudvirtan1001 | Dell | 12 | 7.2k 4TB | 48TB | 24TB | 20TB | -17% | - |
cloudvirtan1002 | Dell | 12 | 7.2k 4TB | 48TB | 24TB | 20TB | -17% | - |
cloudvirtan1003 | Dell | 12 | 7.2k 4TB | 48TB | 24TB | 20TB | -17% | - |
cloudvirtan1004 | Dell | 12 | 7.2k 4TB | 48TB | 24TB | 20TB | -17% | - |
cloudvirtan1005 | Dell | 12 | 7.2k 4TB | 48TB | 24TB | 20TB | -17% | - |
labstore1004 | Dell | 26 | 7.2k 2TB | 52TB | 26TB | 24TB | -8% | - |
labstore1005 | Dell | 26 | 7.2k 2TB | 52TB | 26TB | 24TB | -8% | - |
labstore1006 | HP | 12 | 7.2k 6TB | 72TB | 36TB | 30TB | -17% | - |
labstore1007 | HP | 12 | 7.2k 6TB | 72TB | 36TB | 30TB | -17% | - |
cloudstore1008 | Dell | 12 | 7.2k 6TB | 72TB | 36TB | 30TB | -17% | - |
cloudstore1009 | Dell | 12 | 7.2k 6TB | 72TB | 36TB | 30TB | -17% | - |
Important notes:
- Nominal disk capacity used (base 10 sizes published by vendor)
- Disks dedicated to operating system are ignored
- cloudvirtan* are "owned" by the Analytics team and may or may not be sized to allow for spares
- labstore100{4,5,6,7} are scheduled to be decommissioned with instances on cloudvirt10{19,20} as the replacement
Timeline
If this proposal is accepted, we could adopt two strategies for deployment:
- Piggyback on our migration to Stretch/Mitaka on the hypervisors and take each reimage opportunity to modify the RAID configuration
- Schedule downtime for the remaining hypervisors that are already in Stretch/Mitaka so they're drained, reimaged and put back into production
Since draining hypervisors is a very disruptive process, the complete implementation of this proposal would have to take into account how much downtime we are comfortable with.
It's expected to be a year goal if not more.
Voting
Please add more stakeholders as needed. Vote Yes/No and a justification.
Name | Vote | Comment |
---|---|---|
Andrew Bogott | - | - |
Arturo Borrero | Yes | We should increase robustness and resilience of CloudVPS. I know this involves capacity/budget/refresh planning. |
Brooke Storm | - | - |
Bryan Davis | Yes | Support for the piggyback strategy + investigating "converged infrastructure" idea of re-purposing cloudvirt local storage as Ceph storage that is exposed back to the cloudvirts for instance storage. |
Giovanni Tirloni | Yes | Reasons: Engineer time is more expensive than cost of spare disks. Lack of redundancy is unacceptable in face of data loss (which has already occurred). We cannot maintain any meaningful SLA with humans in the critical path. |
Decision
While some team members didn't cast their vote formally in this document, the majority seemed to agree that adopting hot spare disks was a good strategy in a meeting held on Feb 12.
We will:
- Reconfigure RAID arrays to add hot spares at every opportunity we have (reimages, moving hypervisors to the new eqiad1 region, etc)
- We will not drain existing hypervisors simply to reconfigure RAID because it's a too time-consuming process
- New servers being bought for codfw and only serving dev/test purposes will have only 1 hot spare disk instead of 2 (like in production hypervisors)
- Investigate using the hypervisors as Ceph nodes themselves in a "converged infrastructure" approach
RAID configuration
Initial RAID configuration is done using the last two disks as hot spares. As disks fail and get replaced the hot spares will end up being in different slots though.
Dell servers
Reboot and reconfigure it through the UI interface.
Remember to go into 'Advanced' while creating the volume and select 'Add hot spares' and 'Initialize'.
When selecting the disks for the volume, leave the last 2 disks unselected. Unintuitively, a new window will pop up after the volume is creating asking you to select the spare disks.
HP servers (Gen8)
There is an issue pressing F3 to delete existing RAID volume. If possible, run commands from Linux.
Show current status:
=> ctrl slot=0 pd all show
Smart Array P420i in Slot 0 (Embedded)
array A
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK)
array B
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 300 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 300 GB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 300 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 300 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 300 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 300 GB, OK)
physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 300 GB, OK)
physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 300 GB, OK)
physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 300 GB, OK)
physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 300 GB, OK)
physicaldrive 1I:1:13 (port 1I:box 1:bay 13, SAS, 300 GB, OK)
physicaldrive 2I:1:14 (port 2I:box 1:bay 14, SAS, 300 GB, OK)
physicaldrive 2I:1:15 (port 2I:box 1:bay 15, SAS, 300 GB, OK)
physicaldrive 2I:1:16 (port 2I:box 1:bay 16, SAS, 300 GB, OK)
physicaldrive 2I:1:17 (port 2I:box 1:bay 17, SAS, 300 GB, OK)
physicaldrive 2I:1:18 (port 2I:box 1:bay 18, SAS, 300 GB, OK)
Delete existing RAID volume:
=> ctrl slot=0 ld 2 delete forced
Warning: Deleting an array can cause other array letters to become renamed.
E.g. Deleting array A from arrays A,B,C will result in two remaining
arrays A,B ... not B,C
Create new RAID volume (leaving 2 disks for spares):
=> ctrl slot=0 create type=ld raid=1+0 drives=1I:1:3,1I:1:4,1I:1:5,1I:1:6,1I:1:7,1I:1:8,1I:1:9,1I:1:10,1I:1:11,1I:1:12,1I:1:13,2I:1:14,2I:1:15,2I:1:16
Add hot spares to the array:
=> ctrl slot=0 array B add spares=2I:1:17,2I:1:18
Show RAID volume:
=> ctrl slot=0 ld 2 show
Smart Array P420i in Slot 0 (Embedded)
array B
Logical Drive: 2
Size: 1.9 TB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 65535
Strip Size: 256 KB
Full Stripe Size: 1792 KB
Status: OK
Caching: Enabled
Unique Identifier: 600508B1001C2FD858B7C664FB32BECD
Disk Name: /dev/sdb
Mount Points: None
Logical Drive Label: 045EEA760014380311954D026CC
Mirror Group 1:
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 300 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 300 GB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 300 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 300 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 300 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 300 GB, OK)
physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 300 GB, OK)
Mirror Group 2:
physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 300 GB, OK)
physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 300 GB, OK)
physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 300 GB, OK)
physicaldrive 1I:1:13 (port 1I:box 1:bay 13, SAS, 300 GB, OK)
physicaldrive 2I:1:14 (port 2I:box 1:bay 14, SAS, 300 GB, OK)
physicaldrive 2I:1:15 (port 2I:box 1:bay 15, SAS, 300 GB, OK)
physicaldrive 2I:1:16 (port 2I:box 1:bay 16, SAS, 300 GB, OK)
Drive Type: Data
LD Acceleration Method: Controller Cache
Show all disks (confirm there are 2 spares):
=> ctrl slot=0 pd all show
Smart Array P420i in Slot 0 (Embedded)
array A
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 146 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 146 GB, OK)
array B
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 300 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 300 GB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 300 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 300 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 300 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 300 GB, OK)
physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 300 GB, OK)
physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 300 GB, OK)
physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 300 GB, OK)
physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 300 GB, OK)
physicaldrive 1I:1:13 (port 1I:box 1:bay 13, SAS, 300 GB, OK)
physicaldrive 2I:1:14 (port 2I:box 1:bay 14, SAS, 300 GB, OK)
physicaldrive 2I:1:15 (port 2I:box 1:bay 15, SAS, 300 GB, OK)
physicaldrive 2I:1:16 (port 2I:box 1:bay 16, SAS, 300 GB, OK)
physicaldrive 2I:1:17 (port 2I:box 1:bay 17, SAS, 300 GB, OK, spare)
physicaldrive 2I:1:18 (port 2I:box 1:bay 18, SAS, 300 GB, OK, spare)
HP servers (Gen9)
Enter the HP RAID configuration:
- Reboot server
- Press ESC+9 to enter menu
- Select RAID controler on slot 1
- Select option to open configuration utility
- Wait for "error: no such device: HPEZCD260" message to disappear
Verify current situation:
=> controller slot=1 ld 1 show
Smart Array P840 in Slot 1
Array A
Logical Drive: 1
Size: 7.3 TB
Fault Tolerance: 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 65535
Strip Size: 256 KB
Full Stripe Size: 1280 KB
Status: OK
MultiDomain Status: OK
Caching: Disabled
Unique Identifier: 600508B1001CB0D3B3EFD3A1715AB007
Disk Name: /dev/sdd
Mount Points: None
Logical Drive Label: 0110F6E7PDNNF0ARH8015T82C5
Mirror Group 1:
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA SSD, 1.6 TB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA SSD, 1.6 TB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SATA SSD, 1.6 TB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SATA SSD, 1.6 TB, OK)
physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SATA SSD, 1.6 TB, OK)
Mirror Group 2:
physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SATA SSD, 1.6 TB, OK)
physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SATA SSD, 1.6 TB, OK)
physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SATA SSD, 1.6 TB, OK)
physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SATA SSD, 1.6 TB, OK)
physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SATA SSD, 1.6 TB, OK)
Drive Type: Data
LD Acceleration Method: SSD Smart Path
Delete existing array:
=> ctrl slot=1 array A delete forced
Warning: Deleting the specified device(s) will result in data being lost.
Continue? (y/n) y
Confirm all disks are now unassigned:
=> ctrl slot=1 pd all show
Smart Array P840 in Slot 1
Unassigned
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA SSD, 1.6 TB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA SSD, 1.6 TB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SATA SSD, 1.6 TB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SATA SSD, 1.6 TB, OK)
physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SATA SSD, 1.6 TB, OK)
physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SATA SSD, 1.6 TB, OK)
physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SATA SSD, 1.6 TB, OK)
physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SATA SSD, 1.6 TB, OK)
physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SATA SSD, 1.6 TB, OK)
physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SATA SSD, 1.6 TB, OK)
Create a new array / logic device (leaving two disks to be spares later):
=> ctrl slot=1 create type=ld drives=1I:1:5,1I:1:6,1I:1:7,1I:1:8,2I:1:1,2I:1:2,2I:1:3,2I:1:4 raid=1+0 forced
Warning: SSD Over Provisioning Optimization will be performed on the physical
drives in this array. This process may take a long time and cause this
application to appear unresponsive. Continue? (y/n)y
Add the last 2 disks as spares:
=> ctrl slot=1 array all add spares=2I:2:1,2I:2:2
Resources
Add papers, external links, etc, that support proposal