Outages can be devastating to digital businesses — if your app isn’t working, or if your site is down, you’re losing revenue by the minute. And in the modern world of high-velocity application development, outages are a matter of if, not when. However, many companies don’t spend the time and resources necessary to prepare for this inevitability, leading to panicked, disorganized, and ineffective incident response. Establishing best practices and investing in the necessary tools not only ensures that each incident is resolved as quickly as possible, but also creates the opportunity to learn from incidents and be more resilient going forward. So what, exactly, does an efficient and effective incident management process look like? What steps must be taken, how should teams collaborate, and what data should be brought to bear?
Incidents management is crucial, though often cumbersome
Engineering teams rely on many tools and datasets to respond to incidents, from metrics to logs to application traces, as well as chat, messaging, and video tools for communication. But a structured incident management process is the glue that holds it all together, combining alerting, collaboration, and documentation in one place. Many teams rely on complex processes and specialized, siloed knowledge, making it harder to align on what needs to be done. An effective incident management workflow should be established when systems are healthy, making clear what info is needed, who is responsible for managing the response, and how to memorialize the incident for future learning. This requires accessible data, well-understood roles and responsibilities, and clearly defined channels of communication — all planned and documented ahead of time — so managing the issue doesn’t interfere with resolving the issue.
How you alert is as important as what you alert on
A triggered alert is generally the start of an incident management workflow, so teams need to be thoughtful about what constitutes alert-worthy data. But the “who” and “when” is just as important as the “what” — the people who are alerted, and when these alerts are escalated, are as significant as the content of the alert. Proper incident response makes being on-call as easy as possible, ensuring that the right people get alerted, with the right information, so they can work together from a shared set of information. This means making the alert, and its accompanying charts and graphs, accessible within collaborative tools, and automating the workflow by which the alert gets sent to the people responsible for handling it.
Unified workflows lead to better collaboration and faster resolution
Once the right people are alerted, and are communicating within their preferred messaging and communications tool, they’ll also need access to all the relevant data from both the current and historical incidents. Teams need the ability to sort incidents by key metadata, view a chronological list of updates contributing to the issue, and provide relevant commentary, context, and outcomes. Having a proven set of integrated tools that consolidate all the necessary data in one place will make this kind of collaboration easier and more fruitful.
Then take advantage of what’s learned to avoid similar issue in the future
Once an incident has been resolved, the next step is taking actions to reduce the likelihood of the same issue occurring again, and making it easy to detect and resolve it in case it does. This is why documentation and postmortems are so important to incident management — if you can correlate a new incident to a past incident, you can figure out if the problem you’re dealing with has already been solved. Proper documentation includes a list of follow up tasks to address acute issues, firm plans to update alerts to reflect what you’ve learned, and a detailed, public postmortem document so everyone on your team — and within the organization — can more deeply understand the issue and identify similar issues that may exist elsewhere. This way, when a similar incident occurs in the future, your team has all the historical information they need in one place.
With effective incident management, you can focus on building
An incident management workflow that utilizes the principles described above will be more effective, more efficient, and easier on engineering teams. Most importantly, it saves time, so teams can focus on building new products and features rather than managing existing issues. If you’re not properly maintaining and remediating what you’re already built, you won’t be able to build the new thing that takes your business to the next level. Better incident management is an important way of making this possible.