In early 2020, before we realized that the pandemic was surging under our noses and that we were mere months away from global lockdowns, I was sitting in a bar off of Brannan Street in San Francisco talking about building my new company, Kintaba, around modern incident management. At the time, this seemed to be a relatively obscure topic in Silicon Valley, historically housed within SRE, Engineering, and DevOps teams at large tech unicorns like Facebook and Google.
With me at the bar was a known figure in the emerging resilience engineering field, who himself had plenty of experience attempting to deploy modern incident tools and processes at startups and megacorps alike; processes borne from the principles of crisis response at fire departments, the DoD, and the airline industry. And I distinctly recall him noting, with a concerned look on his face, an obvious challenge… most people weren’t even sure what a “major incident” was in the first place.
Now, one year later, after 85.6 million cases, and nearly 2 million deaths, the first batch of vaccines has arrived and for the first time in what felt like a very long time, there seems to be a light at the end of the tunnel of what was 2020, and I think it’s safe to say that our collective understanding of the phrase “major incident” has shifted from niche to universal.
RELATED CONTENT: Coronavirus and IT operations
In his book The Field Guide to Understanding Human Error, a handbook on the history of how we’ve learned to respond to major system failure and critical incidents in the airline industry, Sidney Dekker writes that one of the most challenging aspects of reflecting on and learning from an incident is simply that if you were not there yourself, then by definition you are on the outside and lack the necessary context to explain or learn from the event effectively.
Dekker notes that only those present can accurately reflect on the experience — something managers, accident investigators, and the public have found difficult to internalize, despite researchers understanding the phenomenon since at least the late 1940s.
But now, as we enter 2021, and ALL of us have experienced the truly unexpected, black-swan, major incident that was the global COVID pandemic… I think we have the unique opportunity to seriously apply incident management philosophy to our experiences. Not only to help better understand how we responded as a community to this unprecedented event (and where we can improve next time), but also more generally about how we can better approach future crises in our lives, whether it be within our families or within our companies.
Lesson 1: It’s not your fault
In the mid-1950s, factory foremen started to notice that punishing people for making mistakes, even if that mistake cascaded into a major failure, did not actually reduce the recurrence of similar mistakes throughout the line.
Despite the overwhelming threat of punishment and fear of losing one’s livelihood, workers continued to make similar and repeated mistakes until eventually, clever owners began to realize that, in lieu of punishing a worker for their mistakes, they should have them write down their account of what happened and how. The worker was then instructed to distribute that writing to the whole company, sharing their unique insights into the nature of the mistake while also exposing the systems, processes, and context that caused them to fail in the first place.
As a result of this, not only did recurrence of the failure drop, but process and practices were changed as supervisors recognized the root cause was systemic, not human.
The learning was critical to the birth of incident management: blaming people for failing when acting in good faith is a doomed path to resilience. Instead, we have to look at the system that allowed or even encouraged the person to fail.
Applied to the pandemic, this lesson is true not just for the importance of forgiving ourselves for our actions that in hindsight were not ideal— maybe you didn’t wear a mask right away, or still sent your kids to schools that were open, or visited your parents— but also for forgiving others who made best-effort decisions that turned out to be regrettable when more information became available. Instead, we need to be asking how the system of processes and expectations allowed them to fail in the first place.
For example in New York City (where I live), school administrators were locked into a battle with the governor, the mayor, and parents about whether or not to shut down. In the absence of executive orders or guidance from the governor (as happened in states like California), parents were left in a state of decision paralysis. Having clearer guidelines in place would have empowered more people to make better decisions. Which leads us to…
Lesson 2: Declare emergencies early and often
It can seem counter-intuitive, but the philosophy of incident management teaches us that even with incomplete information, it is important to declare an emergency early, and to empower more people with the ability to make that declaration.
On factory floors you’ll always see this principle on full display as the “the big red button.” Anyone on the floor, at their own personal discretion, can push the button and halt all of the machinery if they feel something is unsafe or at risk of becoming so, regardless of their reasoning. There is no form to submit before pushing the button, and there is no chain of command to request approval from.
In the airline industry, the crucible of incident response where the stakes are some of the highest, companies even encourage and reward their mechanics and pilots for filing large numbers of seemingly lower priority incident reports, as it’s proven time and time again that this reduces critical failures down the line, and thus saves lives by reducing crashes.
In the early days of the pandemic, Dr. Li Wenliang tried to raise the alarm in Wuhan when he first realized the virus posed a unique threat as he saw patient after patient begin to flood into his clinic, but rather than being empowered and trusted, he was quickly silenced by authorities who did not trust his on-the-ground observations. The big red button was, in this case, encased in the glass of procedure, bureaucracy, and global politics.
Even worse, after China did realize the risk and began to shut down aggressively, governments across the world by and large chose to ignore the threat and wait for it to escalate locally, rather than preemptively taking measures even as local experts warned of a rapidly increasing risk to the global population. The lack of early action ultimately caused the now infamous supply-chain failure when the virus was finally spreading too rapidly to ignore and entire countries’ populations rushed their local stores for critical goods.
Lesson 3: Everyone has a role to play
A common misconception is that it’s solely the engineering team’s responsibility to respond to major incidents and outages, just like we may think it’s the drug companies’ sole responsibility to respond to a pandemic. But the truth is that everyone in the company plays a role and is impacted by every incident, and because of that, everyone needs access to the most up-to-date information on the incident at all times.
At a corporation, that means everyone from your most junior sales hire to your C-Suite should have unobstructed visibility into the response process. The natural resistance to this approach is that these “outsiders” will cause thrash and communications overhead, but the reality is the exact opposite: by giving everyone visibility the need to “go through channels” to request updates and express urgency vanish, as the urgency and status are on full display at all times.
If we think about the pandemic, it should be obvious how critical open and clear information is for each of us — not only to make good decisions in our own day-to-day lives, but to reassure us that the core responding “teams” (our governments, health care workers, pharmaceutical manufacturers and the like) are working with the urgency we would ourselves hope to exert in their shoes.
Lesson 4: Take time to reflect
Perhaps the easiest piece to forget, but really the most important aspect of any successful incident response, actually happens after the incident is over. At tech unicorns like Facebook and Google, those involved and impacted by major outages and failures write documents called “postmortems” to formally record the incident, its impact, the root cause, and critically, the followup steps to be taken to make certain that such an event never occurs again.
These documents are widely distributed within the team or company, allowing others to provide input and to ask questions. Crucially, they’re also written without pointing fingers or assigning blame, instead focusing on the systemic context that allowed the situation to occur and how those processes can be changed for the better with specific follow-up actions.
Teams then come together to reflect and discuss each incident and its postmortem document, allowing everyone to finally bring closure to the event for the responders and the company as a whole, confident that their efforts will result in a more resilient environment moving forwards.
After 2020, it’s more important than ever for each of us to find our own moment to reflect on our experience with the pandemic, and I think we can look to these healthy motions from incident management as a framework for closure. Kintaba’s name comes from the Japanese art form kintsugi, where broken pottery is mended with golden inlay such that the repaired object is more valuable than the original, and I think it’s a powerful metaphor to employ when we think about resilience in our companies and ourselves. Major incidents are ultimately positive opportunities to make our companies, our communities, and ourselves stronger than we were before.