When Google first came up with the term Site Reliability Engineering, it stemmed from its own production growth and challenges.
“SRE is what you get when you treat operations as if it’s a software problem. Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services — Google Search, Ads, Gmail, Android, YouTube, and App Engine, to name just a few — with an ever-watchful eye on their availability, latency, performance, and capacity,” the company wrote on its website.
Since then, the role of a site reliability engineer has been created and more businesses are implementing SREs into their teams. A recent report from Catchpoint found there are more than 1,000 job listings for SREs posted on LinkedIn.
Developed independently of SRE was DevOps, a community effort to get developers to work with operations and understand how their code runs in production. “They would throw this code over the proverbial wall to the operations team, which would be responsible for keeping the applications up and running. This often resulted in tension between the two groups, as each group’s priorities were misaligned with the needs of the business. DevOps emerged as a culture and a set of practices that aims to reduce the gaps between software development and software operation,” Google’s developer advocate Seth Vargo and SRE Liz Foung-Jones wrote in a post.
Because of their similarities, many people in the industry are confused about what it means to do SRE and what it means to do DevOps. Google’s director of customer reliability engineering and network capacity Dave Rensin spoke at this week’s O’Reilly Velocity to explain where SRE fits within DevOps, and his answer was they should be best friends.
According to Rensin, DevOps and SRE reinforce each because they share many of the same principles. However, Rensin stated that SRE is an “opinionated concrete implementation of DevOps principles.”
Some differences, according to Google, include:
- While DevOps aims to reduce organization silos, SRE provides ownership by giving developers and operations that same tools and techniques to work with
- While DevOps says failure should be accepted, SRE believes in having a formula for dealing with those failures
- While DevOps promotes gradual change, SRE encourages moving quickly to reduce cost of failure
- While DevOps leverages tools and automation, SRE aims to minimize manual systems and focus on long-term value efforts
- While DevOps tells us to measure everything, SRE defines clear ways to measure things like availability, uptime and outages
“If you think of DevOps like an interface in a programming language, class SRE implements DevOps. While the SRE program did not explicitly set out to satisfy the DevOps interface, both disciplines independently arrived at a similar set of conclusions. But just like in programming, classes often include more behavior than just what their interface defines, or they might implement multiple interfaces. SRE includes additional practices and recommendations that are not necessarily part of the DevOps interface,” wrote Vargo and Fong-Jones. “DevOps and SRE are not two competing methods for software development and operations, but rather close friends designed to break down organizational barriers to deliver better software faster.”
The number one thing Rensin said teams should remember is that the most important feature in their systems is reliability. If a system is not reliable and does not meet user expectations, users will not trust it and if they don’t trust it, they will not use it. “There is not such thing as a very valuable system with no users,” he said. “Reliability is the most important feature because it is the basis of user trust.”