Cloud adoption is at an all-time high, but massive outages, such as last week’s AWS outage have highlighted that there are indeed some tradeoffs that occur when moving to the cloud, such as their applications being unavailable when their cloud provider experiences an outage.
Even though systems are designed for reliability — especially at the major cloud providers — no system is 100% perfect and failures are still bound to happen at some point.
“One of the problems is when there’s a failure, it just affects so many people in so many organizations, because so many things are using that one system,” said Chris Gladwin, CEO and founder of data analytics company Ocient. “So that’s one of the challenges. The total number of failures that occur as a result of the cloud technology really being robust, that’s gotten better and it’s less common that they fail. The issue, though, is when they fail, the implications are planet wide, like we saw.”
Another tradeoff has to do with the cost of the cloud at scale. According to Gladwin, when a new application is first experiencing initial growth and scale, it makes sense to use cloud resources because their elasticity and ability to provision on-demand helps to maximize efficiency in the growth phase. But once an application reaches a hyperscale stage, it’s significantly more expensive to put it in the cloud.
The tricky thing, according to Gladwin, is that growing to hyperscale doesn’t happen at one moment in time.
“If you’re going to spend a lot of time in that growth mode, cloud-based deployments are really valuable because they’re much quicker to provision, much more dynamic, and you can take advantage of the elastic properties,” said Gladwin. “When you’re growing, growing, growing, what really matters is how quickly you can double resources, how quickly you can double resources again … So during that growth phase, starting from zero, that flexibility really is important.”
Andreessen Horowitz released a report earlier this year that showed this exact phenomenon. It found that cloud delivers on its promise of cost-savings early in an application’s journey, but as the company grows, “the pressure it puts on margins can start to outweigh the benefits.”
The report noted that since this shift happens later in a company’s life, it can be difficult to make changes after years of development effort has been spent on new features, and almost no attention has been put towards infrastructure optimization. As awareness of this phenomenon increases, companies have begun to start repatriating their workload from the cloud or adopting a hybrid approach. According to Andreessen Horowitz, Dropbox repatriated its workloads from the cloud and was able to save $75 million over two years as a result.
Gladwin sees that there is a lot of interest in hybrid cloud deployments because they want the best of both worlds. Public cloud can provide management advantages around security and provisioning, while an on-premises or managed service can provide cost efficiency for those hyperscale applications.
Going back to the issue of reliability, that’s also a cost challenge. Put simply, greater reliability costs more money. Companies need to assess what level of reliability is going to be acceptable to them. If the standard reliability that a cloud provider has meets those requirements, then that’s fine, but if not, then the costs start to rise.
For example, companies that need higher reliability might need to adopt a multi-cloud approach where they deploy to multiple clouds so that they have a backup if one of those clouds fails.
“So, you know, the way that works is you always have to look at things like, what is it you want to protect against? Is it a network outage? Is that a cloud outage? Then you have to design and implement a solution that can tolerate whatever level of failure, like I want to be able to tolerate three different network providers going out. Well, that means you have to have four independently connected. So you have to define what your reliability requirements are and then design around them. And the reason why people don’t design things with 100% reliability no matter what happens is it is more expensive. And in some cases, you need to pay that price. But you always want to adjust the level of reliability to just above what you need. Because you only want to spend the money that you have to,” Gladwin added.