On-call is broken. It’s not a little broken. It’s really broken. We all take the current on-call experience for granted. It’s the way it’s always been. Most of us never stop and think about how expensive, manual, and error-prone on-call really is. That is, until a major outage sets your company back and damages trust with your customers.
If you’re like most developers, you hate being on call. No one wants to be awakened in the middle of the night to resize a disk! But it goes way beyond that. It’s one thing to support your own code; it’s another to diagnose problems in code you didn’t write. All too often, there is no runbook for the issue they are being asked to diagnose and if there is one, it’s long, complicated, and too often out-of-date. And, of course, there is no glory in debugging an issue. Developers get recognized for innovation, when it’s a new feature or some new automation.
The fact that on-call incident response is broken costs you time, toil, and money. It can also cost you customers, damage your company’s reputation, and block potential innovation.
On August 25, 2021, Amazon’s 13 minutes of downtime translated to almost $5 million in lost revenue.
On October 4, 2021, Facebook (and its subsidiaries Messenger, Instagram, and WhatsApp) was globally unavailable for six hours. Estimated lost revenue: $100 million due to a network configuration error.
Additional outages have been reported with Verizon, Microsoft, and AWS, affecting millions of users and costing millions of dollars.
And that is just the beginning. Engineering leaders often overlook the magnitude of the day-to-day costs of being on call. There are over 1.2 million site reliability engineers (SREs) and cloud operations engineers on LinkedIn. These are the engineers who work on improving software system reliability across a number of key areas, including incident response. The cost of these engineers is over $180 billion, which is more than the revenues of AWS, Azure, and Google Cloud Platform combined. And the incidents they are fixing lead to almost 1 billion hours of degraded service for customers. And businesses are constantly trying to hire more SREs. As demand for SREs reaches an all-time high, so does SRE burnout; the average SRE tenure is less than 18 months. Companies are hiring more people to play “whack-a-mole,” handling one issue as three more pop up. The result? Companies are spending more time keeping the lights on vs. innovating, placing them at a competitive disadvantage.
So, how did we get here?
Operations are more complex than ever before. Today’s production fleets are convoluted environments with a mixture of VMs and containers running across multiple clouds and multiple accounts. Keep in mind that each environment has its own nuances, credentials, and APIs. All of this makes on-call work very tedious and automation even harder. Also, faster release cycles translate to an ever-increasing burden being placed upon engineers working with systems in production.
Few companies have fully internalized and aggressively adopted an automation strategy for incidents — and this is a huge gap in the software development lifecycle. While testing, deployment, and configuration has been automated, the manual execution of tasks in production has become a bottleneck with engineers addressing the same or similar tasks repeatedly. Companies do have observability and incident management tools in place that can shine some light on an issue and route it to the appropriate channel. Despite this automatically generated incident alert, a human is still required to manually repair the issue. This lack of effective automation within production operations means that downtime, errors, and toil just continue to grow.
What about simply enabling more people to run existing scripts? While a good first step, this is quite difficult with most of today’s tools — and it isn’t scalable. If you write a script that runs on one box, this is a straightforward task. However, determining where and when to run this script across thousands of boxes with the right credentials can be a daunting project.
When it comes to debug and repair, the engineer must log into box after box when an alarm sounds to first diagnose and then fix the problem. Since automation itself can be time-consuming, on-call teams only automate away a tiny fraction of the issues they deal with on a day-to-day basis. As there is a huge array of on-call incidents, the effort to automate away any one issue is often deemed too much. On top of that, the industry overall is reinventing the wheel repeatedly. Every company experiences full disks, memory leaks, and networking issues — and each company is figuring out how to debug these issues even though thousands of companies have done it before.
The missing link
Production operations and on-call lack a critical third pillar: incident automation. People have far more tools to find problems than they do to actually fix problems. Companies have addressed observability (monitoring and detection) and incident management (which assigns and prioritizes incidents), but few are focused on incident automation, which encompasses diagnosis, repair, and automation. Too much of on-call and incident repair is manual. The lack of automation, or even partial automation, in this key aspect of production ops is costing the industry dearly.
In the production ops world, even a 0.1 percent human error rate can lead to a major outage down the line. And, reliance on runbooks, detailed guides for completing a commonly repeated task within the IT operations process, isn’t working. In fact, these runbooks (documentation or wikis) are often ignored. Meanwhile, employee skill sets vary from “beginner” to “advanced” and institutional knowledge is continually walking out the door due to high turnover. The knowledge is in their heads (not the runbooks) and when employees leave, so does valuable information.
Now is the time to automate production ops. Companies should work to automate away repetitive incidents in production, including expired certificates, disk failures, stuck pods, and JVM memory leaks. This is particularly important since production operations is a 24×7 function. This reduces errors, IT fatigue, and increases time available for higher value work. Yet, companies have been reluctant to automate, citing how complex and time-consuming it is.
How do you start your path to production ops automation?
Automation can be an intimidating process. While you can’t automate everything, smart automation will simplify and streamline your production ops. Here are some guidelines:
- Crawl before you walk, and walk before you run. Start tracking and categorizing your tickets so that you can truly understand both the impact on customers and the engineering cost for each issue.
- When you ticket or track your issues, be sure you can measure how many hours it took to address each issue. This will allow you to prioritize where to invest in automation.
- Standardize debugging practices. At most companies, there are a top 5 or top 7 diagnostics that your best engineers run almost every time they debug an issue. Automate the collection of these diagnostics.
- Build precise alarms mapped to specific issues. This is an overlooked but critical step for automation. If your alarm is too generic, it is impossible to know what caused the issue and then it’s impossible to automate the repair.
- Then build “human-in-the-loop” automations to repair issues. This will ensure that you still have human oversight while dramatically improving mean time to repair. It will also allow you to empower a much broader team to repair many common incidents.
- Ensure your human-in-the-loop automations include post-repair diagnostics so that you can confirm that your automation is actually what fixed the problem.
- Once your team has seen and fixed the same issue multiple times with the same approach, then you should be ready for full automation. No matter how you choose to automate, make sure that you treat your automations just like any other production code and integrate the deployment of these automations into your standard CI/CD process.
The time to invest in automation is now. The current manual strategy is doomed. Every year, the number of incidents increases exponentially. Few companies are fixing tomorrow’s issues today, and automation keeps you ahead of the curve as your team spends more time innovating (and less time debugging).