High availability means that an IT system, component, or application can operate at a high level, continuously, without intervention, for a given time period. High-availability infrastructure is configured to deliver quality performance and handle different loads and failures with minimal or zero downtime.
High-availability clusters are servers grouped together to operate as a single, unified system. Also known as failover clusters, they share the same storage but use different networks. They also share the same mission, in that they can run the same workloads of the primary system they support.
If a server in the cluster fails, another server or node can take over immediately to help ensure the application or service supported by the cluster remains operational. Using high-availability clusters helps ensure there is no single point of failure for critical IT and reduces or eliminates downtime.
High-availability clusters are tested regularly to confirm nodes are always at the ready. IT administrators will often use an open-source heartbeat program to monitor the health of the cluster. The program sends data packets to each machine in a cluster to confirm that it is functioning as intended.
High-availability software is used to operate high-availability clusters. In a high-availability IT system, there are different layers (physical, data link, network, transport, session, presentation, and application) that have different software needs.
At the application layer, for example, load-balancing software—which is used to distribute network traffic and application workloads across servers—is considered critical to help ensure high availability of an application.
High-availability software solutions typically provide load balancing and redirecting, automatic application failover, real-time file replication, and automatic failback capabilities.
High-availability IT systems and services are designed to be available 99.999% of the time during both planned and unplanned outages. Known as five nines reliability, the system is essentially always on.
If critical IT infrastructure fails, but is supported by high availability architecture, the backup system or component takes over. This allows users and applications to keep working without disruption and access the same data available before the failure occurred.
IT disaster recovery refers to the policies, tools, and procedures IT organizations must adopt to bring critical IT components and services back online following a catastrophe. An example of an IT disaster is the destruction of a data center due to a natural event like a major earthquake.
Think of high availability as a strategy for managing small but critical failures in IT infrastructure components that can be easily restored. IT disaster recovery is a process for overcoming major events that can sideline entire IT infrastructures.
Both high availability and disaster recovery are important for enhancing business continuity. So, too, is fault tolerance, as described later in this article. Planning for high availability includes identifying the IT systems and services deemed as essential to help ensure business continuity.
High-availability IT infrastructure features hardware redundancy, software and application redundancy, and data redundancy. Redundancy means the IT components in a high-availability cluster, like servers or databases, can perform the same tasks.
Redundancy is also essential for fault tolerance, which complements high availability and IT disaster recovery, as discussed later in this article.
Replication of data is essential to achieving high availability. Data needs to be replicated and shared with the same nodes in a cluster. The nodes must communicate with each other and share the same information, so that any one of them can step in to provide optimal service when the server or network device they are supporting fails.
Data can also be replicated between clusters to help ensure both high availability and business continuity in the event a data center fails.
A failover occurs when a process performed by the failed primary component moves to a backup component in a high-availability cluster. A best practice for high availability—and disaster recovery—is to maintain a failover system that is located off-premises.
IT administrators monitoring the health of critical primary systems can quickly switch traffic to the failover system when primary systems become overloaded or fail.
As noted earlier, high availability and disaster recovery are both important for business continuity. Together, they help organizations to build high levels of fault tolerance, which refers to a system's ability to keep operating without interruption even if multiple hardware or software components fail.
Fault tolerance aims for zero downtime, while high availability is focused on delivering minimal downtime. A high-availability system designed to provide 99.999%, or five nines, operational uptime expects to see 5.26 minutes of downtime per year.
Unlike high availability, delivering high-quality performance is not a priority for fault tolerance. The purpose of fault-tolerance design in IT infrastructure is to prevent a mission-critical application from experiencing downtime.
Fault tolerance is a more expensive approach to ensuring uptime than high availability because it can involve backing up entire hardware and software systems and power supplies. High-availability systems do not require replication of physical components.
High availability and fault tolerance complement each other in that they help to support IT disaster recovery. Most business continuity strategies include high-availability, fault-tolerance, and disaster-recovery measures. These strategies help the organization maintain essential operations and support users when facing any type of critical IT failure, small or large.