Ensuring high availability and disaster recovery for SQL Server in the cloud

Published: October 21st, 2019

Failure is an option in IT, including in public clouds — despite their resilient, carrier-class infrastructures. And many of the failures that regularly cause downtime are excluded from the cloud service providers’ money-back service level agreement guarantees. For this reason, the customer, specifically the IT operations staff, is responsible for ensuring uptime for all SQL Server applications running in hybrid and purely public cloud environments.

Implementing high availability and disaster recovery protections for SQL Server with the three major cloud service providers (Azure, Amazon and Google) requires some practical guidance, beginning with an important difference between HA and DR.

HA and DR provisions replicate data differently, and that difference can have important ramifications for how failovers are handled. The redundant resources (systems, software and data) for HA should be local on a Local Area Network, which facilitates synchronous data replication to a “hot” standby instance that is ready to take over immediately and automatically in the event of a failure.

By contrast, the redundancy required to recover from a widespread disaster must be “long distance” across a Wide Area Network. Because the latency inherent in the WAN would adversely impact on SQL Server’s throughput performance in the active instance with synchronous replication, data is replicated asynchronously in DR configurations. And because the resulting replication lag leads to having a “warm” standby instance, the best way to minimize data loss is with a manual failover procedure. Emphasizing the recovery point introduces an unavoidable delay in the recovery time, but such a tradeoff is generally considered acceptable given the rarity of widespread disasters.

All three major cloud service providers (CSPs) accommodate these differences with redundancies both within and across data centers. For HA, all have variously named “availability zones” that combine the synchronous replication available on a LAN with some geographical separation across the WAN. Each zone has a low-latency, high-throughput network connecting two or more regional datacenters to facilitate synchronous data replication. With latencies around one millisecond, the use of multi-zone configurations has become a best practice for HA.

All three major CSPs also have service level agreements (SLAs) that offer refunds for when uptime falls below specified levels, usually ranging from 95.00% to 99.99%. All three require the use of availability zones to be eligible for the 99.99% uptime guarantee, which is generally accepted as constituting HA.

Caveat Emptor: The SLAs only assure “dial tone” at the server level and exclude many common causes of downtime at the database and application levels. Explicitly excluded are natural disasters, the customer’s actions (or inactions), and customer-supplied system or application software. So while it is advantageous to leverage various aspects of a CSP’s infrastructure, additional provisions are needed to ensure having adequate uptime for mission-critical SQL Server databases.

SQL Server’s “Always On” options
SQL Server offers two of its own HA/DR features: Always On Failover Cluster Instances and Always On Availability Groups. Failover Cluster Instances (FCIs) have three notable advantages: support in all versions since SQL Server 7; protection of the entire SQL Server instance; and perhaps most significantly, inclusion in the less expensive Standard Edition.

The problem is: FCIs are not simply supported for Linux, and for Windows they depend on Windows Server Failover Clustering (WSFC) that requires a storage area network (SAN) or other form of cluster-aware shared storage, which is not available in the public cloud. Microsoft addressed this problem in Windows Server 2016 Datacenter Edition and SQL Server 2016 with the introduction of Storage Spaces Direct, but S2D has its own limitation; most notably an inability to span multiple availability zones, making it unsuitable for HA needs.

SQL Server’s other HA/DR feature, Always On Availability Groups, is supported in SQL Server 2012 and later for Windows and in SQL Server 2017 for Linux. This is SQL Server’s more robust HA/DR offering that works in a cloud environment and supports readable secondaries (with appropriate licensing). But it lacks protection for the entire SQL Server instance and for Windows it requires licensing the substantially more expensive Enterprise Edition, making it cost-prohibitive for many applications.

It is worth noting that a Basic Availability Groups feature was added to SQL Server 2016, but it supports only a single database per Availability Group, making it viable only for the smallest of environments.

SANless failover clustering
A notable disadvantage of application-specific options like Always On Availability Groups is that administrators need to use other HA and/or DR solutions for all non-SQL Server applications. Having multiple solutions inevitably increases both initial and ongoing costs, which is why many organizations prefer using application-agnostic third-party solutions that are purpose-built to provide HA/DR protections for Windows and Linux in private, public and hybrid cloud environments.

These third-party SANless failover clustering solutions are implemented entirely in software that creates, as the name implies, a cluster of servers (with their locally attached storage) and affords automatic failover. All such solutions include real-time data replication, continuous monitoring for detecting failures at the application level, and configurable policies for failover and failback. Most also offer a variety of value-added capabilities to simplify implementation and operation, and some even have special provisions for different versions and editions of SQL Server.

SANless failover clustering makes it possible to use SQL Server’s FCIs with WSFC or SQL Server 2017 for Linux in configurations that span multiple availability zones for HA protection capable of delivering an uptime of 99.99% for the database, regardless of the cause of the failure. DR protection can either be added to the HA cluster or be provided separately using the CSP’s DR provisions, including DR-as-a-Service or DRaaS. SANless failover clustering solutions are proven in practice to be dependable, and are easy to implement and operate. And it is for these reasons that all three CSPs recommend their use for customers running mission-critical applications in their clouds.

Article Tags

disaster recovery, it operations, IT resiliency

About David Bermingham

David Bermingham is technical evangelist at SIOS Technology

View all posts by David Bermingham

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_WTGVKVXEZJ	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_107693958_2	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.

Cookie	Duration	Description
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
_heatmaps_g2g_101137905	10 minutes	No description
cf_7167_id	20 years	No description
cf_7167_person_last_update	session	No description
GoogleAdServingTest	session	No description
prism_252377639	1 month	No description
querylyvid	3 months	No description
xtc	1 year 1 month	No description

Ensuring high availability and disaster recovery for SQL Server in the cloud

Article Tags

Subscribe to SDTimes

About David Bermingham

Related Articles

Google Cloud adds new backup features to Backup and Disaster Recovery service

CrowdStrike outages underscore importance of incident planning

Report: Over 90% of companies have a disaster recovery plan

AWS disaster recovery strategies