Since Google released its Site Reliability Engineering (SRE) book in 2016, the field has gained widespread attention. However, adopting SRE as defined by Google is not as applicable to most organizations as it may seem, according to Sanjeev Sharma, a principal analyst of Accelerated Strategies, who spoke at Catchpoint’s “SRE from Home” virtual event last week.
When it first created SRE, Google had a team of software developers work in operations with the goal of developing software to handle the vast majority of tasks that were assigned to the system administration teams and incident response teams.
Sharma explained instead of trying to replace Ops with Site Reliability Engineers, organizations should be supplementing their Ops teams Site Reliability Engineering in most enterprises.
RELATED CONTENT: Transitioning to SRE
At Google, the operations teams have to handle incidents and outages on a constant basis. Because their data centers are fairly homogeneous, they can automate a lot of the responses to the incidents because they keep happening over and over again. On the other hand, most enterprises shouldn’t replace their current Ops teams and sys admins with software engineers because enterprises run their data on custom hardware, which Ops teams have expertise in, Sharma explained.
“What [Google] needs to do is have the ability to dynamically shift workloads around so that when one hardware component is being serviced, they can easily pull the workload which is being run without interruption or loss of quality of service to another part of the data center as desired,” Sharma said. “This by itself would disqualify most organizations because most companies are made up of hardware that is not commodity hardware and in most cases is custom hardware or generic hardware that has been optimized for the tasks being run.”
The idea of SRE is to have software developers working in the Ops team is to identify repetitive tasks and automate them so the actual Ops teams can focus on the outliers. Therefore, instead of replacing Ops, Sharma explained organizations should be supplementing their Ops teams with SREs. He added that replacing Ops teams with SREs is the first antipattern because it gets rid of all of their existing data center expertise, which would be detrimental to the enterprise.
Some of these repetitive tasks that are frequently automated include detection and remediation of outages, degradation, and quality of service efforts, according to Sharma.
“You still need to make your services and systems more reliable so maybe you change your definition from site reliability engineering to service reliability engineering, but you don’t need to do it the way Google does,” Sharma said.
A second antipattern that is even more common is that organizations are taking their DevOps team and renaming it as the SRE team.
“First and foremost, you shouldn’t have a DevOps team. There’s no such thing as a DevOps team! DevOps is something everyone does. They have a different role in how DevOps is adopted and the way the tasks are performed but there shouldn’t be a new silo called DevOps team who is the intermediary between all of the stakeholders in your application delivery pipeline,” Sharma said, adding that if an organization has a DevOps team, it should be called DevOps coaching instead.
The third antipattern Sharma found is trying to adopt SRE principles by the book without first changing the culture.
To handle this, organizations must first establish a reliability culture that includes establishing the right service-level objectives, error budgets, using incident postmortems to understand why something went down, and by hiring software engineers in Ops, according to Sharma.
“Adapt the SRE practices for your needs and make an enterprise-wide effort to change the culture and become a culture which focuses on systems and application reliability as everybody’s responsibility,” Sharma said. “A modern system is a constantly changing melange of hardware and software in a variable world. That’s why chaos engineering is very important because you can’t understand how a system behaves without interacting with it and without testing its boundaries and getting ready for outliers. And that is what at the end of the day reliability engineering is all about.”