Managed Hosting

Building Resilient Systems: Redundancy and Failover in DevOps

5 June 2024

By Leo de Jager

- 5 mins read

Chances are you’ve seen service providers promise 99.999% uptime in a given year. When you do the math, that translates to 5.26 minutes of downtime. Those few minutes can make a huge difference in industries like healthcare, finance, and telecommunications, and could lose you customers and sales if you’re running an eCommerce business.

Redundancy and failover can be used to dramatically reduce the potential of any downtime to ensure that you’re always on, and always available. However high availability can only be achieved with careful planning and well-thought-out system design.

Redundancy refers to having more than one of the same function or component to ensure that, should the first one fail, the second can take its place without disrupting any services. A failover is essentially a backup system or component. The switch to the failover system is usually automatic as soon as a failure is detected. With a failover the backup system doesn’t have to be a duplicate of the primary system; redundant functions or components are usually an exact match of the main function or component.

Either of these can be used to ensure the reliability and availability of systems in a DevOps environment, but which one you choose will depend on your requirements and budget. Let’s take a closer look.

Redundancy

Earlier this year it was reported that the Google Cloud account of UniSuper was accidentally deleted due to a misconfiguration by Google. You’d think that an organistion as vast as Google would have contingencies in place – which it did in two distinct geographical locations, but those were deleted as well. Luckily, the UniSuper DevOps team also kept a backup with a different service provider, which could be used to restore the accounts of more than half a million UniSuper account holders.

Data redundancy

The above scenario is classified as data redundancy. While the UniSuper team had backups with Google and a different service provider, creating data redundancy can be as simple as implementing the 3-2-1 backup rule: make two copies of the original data set (so that you have a total of 3 identical data sets), and keep two of them on different media, and one of those copies at a different geographic location.

But data redundancy can also refer to replication, where copies of data are stored across multiple databases or storage systems. Techniques like RAID (Redundant Array of Independent Disks) or distributed databases (e.g., Cassandra, MongoDB) are commonly used for either synchronous replication where data is written to multiple locations in real-time, or asynchronous replication where data is replicated to secondary locations after a delay.

Hardware redundancy

There is no shortage of cases that illustrate the worth of hardware redundancy. One such case involves Amazon AWS. Back in 2011, AWS suffered a major outage that affected high-profile customers such as Reddit, Quora, and Foursquare. But luckily AWS makes use of multiple Availability Zones (AZs) in a region. An AZ comprises locations where hardware resources are housed within a given region. As such, customers AWS maintained service availability and customers remained online.

While virtually any piece of hardware can be made redundant, in a DevOps context we often find the following hardware components with redundancies in place:

Servers: Multiple servers can be used to perform the same task. If one server fails, others can continue to operate.
Power Supplies: Redundant power supplies keep systems operational if one power source fails.
Network Components: Implement redundant switches, routers, and network paths to prevent a single point of failure in the network infrastructure.

Software redundancy

Ever wondered how big brands like Facebook manage to stay online and available 24/7? While the entire strategy is likely top secret and locked away behind some impressive security, software redundancy forms a key part of high availability.

Software redundancy can encompass many different techniques. Two of the most common and most effective are microservices and load balancers.

Microservices refers to the use of a microservices architecture during application design. The idea here is that individual services within the application operate independently. If one service fails, others remain unaffected.

Load Balancers can be used to distribute traffic across multiple instances of an application to ensure that no single instance becomes a point of failure.

Failover

Failover mechanisms automatically switch to a standby system or component when the primary one fails.

There are three types of failover mechanisms:

Cold Failover: With a cold failover standby systems are only activated when the primary system fails. There might be a slight delay as the standby system starts up.
Warm Failover: Here standby systems run at a reduced capacity, but are already active. They can take over faster than cold failover systems when a failure occurs.
Hot Failover: Standby systems run simultaneously with primary systems and can take over instantly without any noticeable downtime.

Unlike redundancies, failover systems don’t have to be identical to the primary system. In fact, as long as a few key requirements are met, the failover system can be vastly different. Those requirements include:

Capacity to handle load. The failover system must be able to handle the same workload or amount of traffic that the primary system can handle. Storage, processing power, memory, and other computational resources should be sufficient.
Compatibility and functionality. The failover system must be compatible with the primary system’s data and processes, even when the underlying architecture and/or technology differs.
Data synchronisation. Data consistency and synchronisation between the primary and failover systems are crucial. The failover system should have access to the latest data or be able to quickly synchronise data to ensure a seamless transition.

A common example of failover is where on-premise systems failover to the cloud. The cloud systems don’t need to resemble the on-premises infrastructure but must be capable of handling the same applications and data load.

In conclusion, achieving high availability and system resilience in a DevOps environment requires a strategic combination of redundancy and failover mechanisms. Redundancy ensures that critical components have backups ready to take over instantly, while failover mechanisms provide seamless transitions to standby systems when primary systems fail. By carefully planning and implementing these strategies, organizations can minimize downtime, maintain service availability, and enhance their overall operational reliability. Whether through data redundancy, hardware redundancy, or software redundancy, each approach plays a vital role in building a robust and resilient system architecture.