Wikimedia Cloud Services team/EnhancementProposals/Systems and Service Continuity

Introduction

Systems and Service Continuity plan helps our team prepare for the worst case scenario; it doesn’t just include details to recover from a system unavailability but to possibly prevent it from happening in the first place. This plan investigates, develops and implements recovery options when an interruption to systems and service occurs.

The goal is the identify risks which could pose threats to the continuity of our systems and service. These typically include, but not limited to:

Application services failures
Hardware failures
Deliberate infiltration
Attacks on critical information systems

Note that this plan does not cover IT Disaster Recovery which usually includes:

All equipment and data within the datacenter destroyed due to natural disaster
Access to the datacenter prohibited due to physical damage or legal reasons (e.g. court proceedings)

Services

The Cloud Services team offers many services which can be broken down to the following categories:

CloudVPS (Infrastructure as a service or IaaS)
Toolforge (Platform as a service or PaaS)
Wiki-replicas, toolsdb (Data as a service or DaaS)
Others

IaaS (hardware)

Network (NIC, switches)
Storage (RAID)
Backup and Recovery

TODO: what is the current status from the availability and continuity point of view. Identify SPOF. (done)

TODO: things to improve in both short term and long term. Do we need them? Are they cost-effective?

Network

Network Components
Component	SPOF	Comments
Network Interface Cards (NIC)	Yes
Network Switches	Yes	Both eth0 and eth1 connect to the same switch
Network Cable	Yes

The limitation of the current network setup is that the server connects to only one network switch. There's no redundant NIC installed on the servers. The NIC, the cable, and the switch to which it connects are all single point of failure. In the event of a network switch failure, human intervention is needed to replace the failed NIC, cable or network switch. This process may take long as we don't have onsite technician available 24/7 at the datacenter.

To achieve higher availability, we can introduce NIC teaming with a secondary switch.

NIC Teaming

The NIC teaming design eliminates this single point of failure by providing special hardware drivers that allow two NIC cards to be connected to two different access network switches. If one NIC card fails, the secondary NIC card automatically assumes the IP address of the first NIC and takes over operation without disruption. The various types of NIC teaming solutions include active/standby and active/active. All solutions require the NIC cards to have Layer 2 adjacency with each other.

Two possible NIC teaming solutions:

Switch fault tolerance (SFT, active/standby): one port is active and the other standby using one common IP address and MAC address.
Adaptive load balancing (ALB, active/active): one port receives and all ports transmit using one IP address and multiple MAC addresses.

Action Plans

Introducing NIC teaming will take a bit of planning and won't get done within a short term (i.e. under a quarter). The next step is to perform cost benefit analysis.

Find out if current NICs has teaming capability; if not, much does it cost for the upgrade
Find out if it's possible and the cost to install another network switch

IaaS (software)

TODO: what is the current status from the availability and continuity point of view. Identify SPOF. Current status.

TODO: things to improve in both short term and long term. Do we need them? Are they cost-effective?. Improvements, ideal model.

Openstack

Openstack is used to manage the cloud computing platform.

Openstack Configuration (EQIAD)
Service	High Availability	Automatic Failover (or all active)	Nodes
Designate	Yes	Yes	cloudservices1003 cloudservices1004
Glance API	Yes	No	cloudcontrol1003 cloudcontrol1004
Glance Registry	Yes	No	cloudcontrol1003 cloudcontrol1004
Keystone	Yes	Yes	cloudcontrol1003 cloudcontrol1004
Neutron	Yes	Yes	cloudcontrol1003 cloudcontrol1004
Nova	Yes	Yes	cloudcontrol1003 cloudcontrol1004

Provisioning/bootstrap

TODO: what does this means? Current status or improvement?

Configuration Management

Puppet is used as the configuration management tool.

TODO: what does this means? Current status or improvement?

PaaS (hardware)

The hardware for our PaaS is CloudVPS.

TODO: what is the current status from the availability and continuity point of view. Identify SPOF. Current status.

TODO: things to improve in both short term and long term. Do we need them? Are they cost-effective?. Improvements, ideal model.

PaaS (software)

Legacy Container Orchestration Platform: Kubernetes

The legacy Kubernetes cluster is not HA because there's only single master node.

Legacy Kubernetes Configuration
Service	High Availablity	Automatic Failover (or all active)	Nodes
API Server (Master)	No	N/A	tools-k8s-master-01
Worker Server	Yes	Yes	tools-worker-10[01-40]
Etcd Server	Yes	Yes	tools-k8s-etcd-01 tools-k8s-etcd-02 tools-k8s-etcd-03
Flannel	Yes	Yes	tools-flannel-etcd-01 tools-flannel-etcd-02 tools-flannel-etcd-03

New Container Orchestration Platform: Kubernetes

The new Kubernetes cluster will be able to achieve HA since each kubernetes service has at least one instance running.

New Kubernetes Configuration
Service	High Availability	Automatic Failover (or all active)
API Server (Master)	Yes	Yes
Worker	Yes	Yes
Etcd Server	Yes	Yes

Son of Grid Engine

TODO

Auxiliary Services

Even though all of the auxiliary services have a standby instance, it is not considered to be HA since the failover process needs to be done manually.

Auxiliary Services Configuration
Service	High Availablity	Automatic Failover (or all active)	Nodes	Comments
Dynamic Proxy	No	No	tools-proxy-05 tools-proxy-06
Docker Registry	No	No	tools-docker-registry-03 tools-docker-registry-04
Redis	No	No	tools-redis-1001 tools-redis-1002	Used by Dynamic Proxy

Legacy Distributed File System: NFS

NFS is currently used as the distributed file system service. It is not HA.

NFS has one active and one standby node. The process to switch from the active node to the standby is protracted and laborious. From the past incidents, we learn that it can easily take multiple resources working in parallel to perform the switch-over. As for the duration, it can easily take over an hour to achieve this as multiple clients need to be restarted and checked in order to re-attach the NFS mount.

New Distributed File System: Ceph

Ceph will be used to replaced NFS as the new distributed file system ^[1]. The design is to be HA once up and running. One of the many unknowns we have with Ceph is speed and reliability. We plan to do multiple POCs and performance testings to address these concerns.

DaaS (Hardware)

Wiki Replicas

TODO: what is the current status from the availability and continuity point of view. Identify SPOF. Current status.

TODO: things to improve in both short term and long term. Do we need them? Are they cost-effective?. Improvements, ideal model.

Dumps

TODO: what is the current status from the availability and continuity point of view. Identify SPOF. Current status.

TODO: things to improve in both short term and long term. Do we need them? Are they cost-effective?. Improvements, ideal model.

NFS

TODO: what is the current status from the availability and continuity point of view. Identify SPOF. Current status.

TODO: things to improve in both short term and long term. Do we need them? Are they cost-effective?. Improvements, ideal model.

Rsync

TODO: what is the current status from the availability and continuity point of view. Identify SPOF. Current status.

TODO: things to improve in both short term and long term. Do we need them? Are they cost-effective?. Improvements, ideal model.

Web

TODO: what is the current status from the availability and continuity point of view. Identify SPOF. Current status.

TODO: things to improve in both short term and long term. Do we need them? Are they cost-effective?. Improvements, ideal model.

NFS

TODO: what is the current status from the availability and continuity point of view. Identify SPOF. Current status.

TODO: things to improve in both short term and long term. Do we need them? Are they cost-effective?. Improvements, ideal model.

DaaS (software)

Toolsdb

TODO: what is the current status from the availability and continuity point of view. Identify SPOF. Current status.

TODO: things to improve in both short term and long term. Do we need them? Are they cost-effective?. Improvements, ideal model.

Wiki Replicas

TODO: what is the current status from the availability and continuity point of view. Identify SPOF. Current status.

TODO: things to improve in both short term and long term. Do we need them? Are they cost-effective?. Improvements, ideal model.

Terminologies

Power Supply

Power is one of the most critical components of the server stability design. When a data center provider offers their facilities’ power redundancy, they are referring to the amount of backup power available. If utility failures occur due to severe weather, equipment failure or power line damage, data centers with more redundant power will be better equipped to avoid costly periods of downtime. For our plan, we need identify the current power design of the data center where our servers are hosted to see if any improvement can be made or requested.

What is N? N+1? 2N?

The symbol N equals the amount of capacity required to power or cool the data center facility at full IT load. A design of N means the facility was designed only to account for the facility at full load and zero redundancy has been added. If the facility is at full load and there is a component failure or required maintenance, mission critical applications would suffer. N is the same as non-redundant.

If N equals the amount needed to run the data center facility N+1 provides minimal reliability by adding a component to support a single failure or requirement of that component.

2N power redundancy and distribution means that the data center has two independent power sources. If one power source has an interruption or loss of power, the other should still supply power thereby eliminating any potential downtime from the loss of the first power source. The major advantage of 2N power is that it is two completely separate inputs, circuits and systems, each with its own complete complement of equipment. In an N+1 configuration there is often one set of cabling that connects to the main power and backup, therefore, one catastrophic failure could wipe out all power options.

Redundancy Benefits Comparison
Redundancy Configuratio	Reliability (normal)	Reliability (maintenance)	Capital Cost	Operating Cost	Complexity	Footprint
N	4	2	10	8	10	10
N+1	6	4	8	9	9	9
N+2	6	5	6	8	9	8
3N/2	8	5	5	6	2	5
2N	9	6	3	4	5	4
2(N+1)	10	9	2	2	4	2

Industry Trend

According to recent operator survey by Uptime Institute’s operator survey in 2018 ^[2] , N+1 architecture designs are becoming more popular, with 51% of operator respondents having N+1 cooling equipment and 41% having N+1 power equipment configurations. The reason for this trend is because 2N design is expensive to build and maintain since it requires 100% replication of components and capacity. N+1 design is cheaper to build and maintain and more energy efficient because the physical infrastructure is lighter and uses less energy

Network Redundancy

Network redundancy is a practice through which alternative or additional network devices and equipment are installed within network infrastructure. This design is to ensure network availability in case of a network device, network path or connectivity failure and unavailability. The ultimate goal is to provide a means of network failover for higher availability.

The following designs and feature can increase network availability of an infrastructure:

NIC Teaming

Physical servers can have many single points of failure. Network Interface Card (NIC), the cable, and the switch to which it connects are all single point of failure. One possible solutions to overcome this limitation is to introduce NIC teaming. This design eliminates this single point of failure by providing special drivers that allow two NIC cards to be connected to two different access switches or different line cards on the same access switch. If one NIC card fails, the secondary NIC card assumes the IP address of the server and takes over operation without disruption. The various types of NIC teaming solutions include active/standby and active/active. All solutions require the NIC cards to have Layer 2 adjacency with each other. NIC teaming solutions are common in the data center multi-tier model design

Server Clustering

The goal of server clustering is to combine multiple servers so that they appear as a single unified system through special software and network interconnects

Clustering is a general term that is used to describe a particular type of grouped server arrangement that falls into the following categories:

High availability clusters: this type of cluster uses two or more servers and provides redundancy in the case of a server failure. If one node fails, another node in the cluster takes over with minor disruption.
Network load balanced clusters (NLBs): This type of cluster of servers that work together to load balance HTTP sessions on a website.
Database clusters: As databases become larger, the ability to search the database becomes more complex and time sensitive. Database clusters provide a way to enable efficient parallel scans and improve database lock times.

Software

Openstack

OpenStack is a free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service (IaaS), whereby virtual servers and other resources are made available to customers.The software platform consists of interrelated components that control diverse, multi-vendor hardware pools of processing, storage, and networking resources throughout a data center. Users either manage it through a web-based dashboard, through command-line tools, or through RESTful web services.^[3] Backend services that are full active / active will automatically be pooled and depooled by HAProxy. As long as there is one instance available there will be no interruption in service.

Kubernetes

Kubernetes is an open-source container-orchestration system for automating application deployment, scaling, and management.It aims to provide a "platform for automating deployment, scaling, and operations of application containers across clusters of hosts". ^[4]

Network File System

Network File System (NFS) is a distributed file system protocol originally developed by Sun Microsystems (Sun) in 1984, allowing a user on a client computer to access files over a computer network much like local storage is accessed. NFS, like many other protocols, builds on the Open Network Computing Remote Procedure Call (ONC RPC) system. ^[5]

Ceph

Ceph is a free-software storage platform, implements object storage on a single distributed computer cluster, and provides interfaces for object-, block- and file-level storage. Ceph aims primarily for completely distributed operation without a single point of failure, scalable to the exabyte level, and freely available. ^[6]

Cost–benefit analysis (CBA)

Sometimes called benefit costs analysis (BCA), is a systematic approach to estimating the strengths and weaknesses of alternatives used to determine options which provide the best approach to achieving benefits while preserving savings (for example, in transactions, activities, and functional business requirements). ^[7]

References

[1] ttps://phabricator.wikimedia.org/T225320

[2] ttps://datacenter.com/wp-content/uploads/2018/11/2018-data-center-industry-survey.pdf

[3] ttps://en.wikipedia.org/wiki/OpenStack

[4] ttps://en.wikipedia.org/wiki/Kubernetes

[5] ttps://en.wikipedia.org/wiki/Network_File_System

[6] ttps://en.wikipedia.org/wiki/Ceph_(software)

[7] ttps://en.wikipedia.org/wiki/Cost%E2%80%93benefit_analysis

[1]

[2]

[3]

[4]

[5]

[6]

[7]