Failure is an option in IT, including in public clouds — despite their resilient, carrier-class infrastructures. And many of the failures that regularly cause downtime are excluded from the cloud service providers’ money-back service level agreement guarantees. For this reason, the customer, specifically the IT operations staff, is responsible for ensuring uptime for all SQL Server applications running in hybrid and purely public cloud environments.
Implementing high availability and disaster recovery protections for SQL Server with the three major cloud service providers (Azure, Amazon and Google) requires some practical guidance, beginning with an important difference between HA and DR.
RELATED CONTENT: IT disaster recovery planning can no longer be ignored
HA and DR provisions replicate data differently, and that difference can have important ramifications for how failovers are handled. The redundant resources (systems, software and data) for HA should be local on a Local Area Network, which facilitates synchronous data replication to a “hot” standby instance that is ready to take over immediately and automatically in the event of a failure.
By contrast, the redundancy required to recover from a widespread disaster must be “long distance” across a Wide Area Network. Because the latency inherent in the WAN would adversely impact on SQL Server’s throughput performance in the active instance with synchronous replication, data is replicated asynchronously in DR configurations. And because the resulting replication lag leads to having a “warm” standby instance, the best way to minimize data loss is with a manual failover procedure. Emphasizing the recovery point introduces an unavoidable delay in the recovery time, but such a tradeoff is generally considered acceptable given the rarity of widespread disasters.
All three major cloud service providers (CSPs) accommodate these differences with redundancies both within and across data centers. For HA, all have variously named “availability zones” that combine the synchronous replication available on a LAN with some geographical separation across the WAN. Each zone has a low-latency, high-throughput network connecting two or more regional datacenters to facilitate synchronous data replication. With latencies around one millisecond, the use of multi-zone configurations has become a best practice for HA.
All three major CSPs also have service level agreements (SLAs) that offer refunds for when uptime falls below specified levels, usually ranging from 95.00% to 99.99%. All three require the use of availability zones to be eligible for the 99.99% uptime guarantee, which is generally accepted as constituting HA.
Caveat Emptor: The SLAs only assure “dial tone” at the server level and exclude many common causes of downtime at the database and application levels. Explicitly excluded are natural disasters, the customer’s actions (or inactions), and customer-supplied system or application software. So while it is advantageous to leverage various aspects of a CSP’s infrastructure, additional provisions are needed to ensure having adequate uptime for mission-critical SQL Server databases.
SQL Server’s “Always On” options
SQL Server offers two of its own HA/DR features: Always On Failover Cluster Instances and Always On Availability Groups. Failover Cluster Instances (FCIs) have three notable advantages: support in all versions since SQL Server 7; protection of the entire SQL Server instance; and perhaps most significantly, inclusion in the less expensive Standard Edition.
The problem is: FCIs are not simply supported for Linux, and for Windows they depend on Windows Server Failover Clustering (WSFC) that requires a storage area network (SAN) or other form of cluster-aware shared storage, which is not available in the public cloud. Microsoft addressed this problem in Windows Server 2016 Datacenter Edition and SQL Server 2016 with the introduction of Storage Spaces Direct, but S2D has its own limitation; most notably an inability to span multiple availability zones, making it unsuitable for HA needs.
SQL Server’s other HA/DR feature, Always On Availability Groups, is supported in SQL Server 2012 and later for Windows and in SQL Server 2017 for Linux. This is SQL Server’s more robust HA/DR offering that works in a cloud environment and supports readable secondaries (with appropriate licensing). But it lacks protection for the entire SQL Server instance and for Windows it requires licensing the substantially more expensive Enterprise Edition, making it cost-prohibitive for many applications.
It is worth noting that a Basic Availability Groups feature was added to SQL Server 2016, but it supports only a single database per Availability Group, making it viable only for the smallest of environments.
SANless failover clustering
A notable disadvantage of application-specific options like Always On Availability Groups is that administrators need to use other HA and/or DR solutions for all non-SQL Server applications. Having multiple solutions inevitably increases both initial and ongoing costs, which is why many organizations prefer using application-agnostic third-party solutions that are purpose-built to provide HA/DR protections for Windows and Linux in private, public and hybrid cloud environments.
These third-party SANless failover clustering solutions are implemented entirely in software that creates, as the name implies, a cluster of servers (with their locally attached storage) and affords automatic failover. All such solutions include real-time data replication, continuous monitoring for detecting failures at the application level, and configurable policies for failover and failback. Most also offer a variety of value-added capabilities to simplify implementation and operation, and some even have special provisions for different versions and editions of SQL Server.
SANless failover clustering makes it possible to use SQL Server’s FCIs with WSFC or SQL Server 2017 for Linux in configurations that span multiple availability zones for HA protection capable of delivering an uptime of 99.99% for the database, regardless of the cause of the failure. DR protection can either be added to the HA cluster or be provided separately using the CSP’s DR provisions, including DR-as-a-Service or DRaaS. SANless failover clustering solutions are proven in practice to be dependable, and are easy to implement and operate. And it is for these reasons that all three CSPs recommend their use for customers running mission-critical applications in their clouds.