What Incident Management teaches us about the pandemic

Published: January 13th, 2021

- John Egan

In early 2020, before we realized that the pandemic was surging under our noses and that we were mere months away from global lockdowns, I was sitting in a bar off of Brannan Street in San Francisco talking about building my new company, Kintaba, around modern incident management. At the time, this seemed to be a relatively obscure topic in Silicon Valley, historically housed within SRE, Engineering, and DevOps teams at large tech unicorns like Facebook and Google.

With me at the bar was a known figure in the emerging resilience engineering field, who himself had plenty of experience attempting to deploy modern incident tools and processes at startups and megacorps alike; processes borne from the principles of crisis response at fire departments, the DoD, and the airline industry. And I distinctly recall him noting, with a concerned look on his face, an obvious challenge… most people weren’t even sure what a “major incident” was in the first place.

Now, one year later, after 85.6 million cases, and nearly 2 million deaths, the first batch of vaccines has arrived and for the first time in what felt like a very long time, there seems to be a light at the end of the tunnel of what was 2020, and I think it’s safe to say that our collective understanding of the phrase “major incident” has shifted from niche to universal.

RELATED CONTENT: Coronavirus and IT operations

In his book The Field Guide to Understanding Human Error, a handbook on the history of how we’ve learned to respond to major system failure and critical incidents in the airline industry, Sidney Dekker writes that one of the most challenging aspects of reflecting on and learning from an incident is simply that if you were not there yourself, then by definition you are on the outside and lack the necessary context to explain or learn from the event effectively.

Dekker notes that only those present can accurately reflect on the experience — something managers, accident investigators, and the public have found difficult to internalize, despite researchers understanding the phenomenon since at least the late 1940s.

But now, as we enter 2021, and ALL of us have experienced the truly unexpected, black-swan, major incident that was the global COVID pandemic… I think we have the unique opportunity to seriously apply incident management philosophy to our experiences. Not only to help better understand how we responded as a community to this unprecedented event (and where we can improve next time), but also more generally about how we can better approach future crises in our lives, whether it be within our families or within our companies.

Lesson 1: It’s not your fault
In the mid-1950s, factory foremen started to notice that punishing people for making mistakes, even if that mistake cascaded into a major failure, did not actually reduce the recurrence of similar mistakes throughout the line.

Despite the overwhelming threat of punishment and fear of losing one’s livelihood, workers continued to make similar and repeated mistakes until eventually, clever owners began to realize that, in lieu of punishing a worker for their mistakes, they should have them write down their account of what happened and how. The worker was then instructed to distribute that writing to the whole company, sharing their unique insights into the nature of the mistake while also exposing the systems, processes, and context that caused them to fail in the first place.

As a result of this, not only did recurrence of the failure drop, but process and practices were changed as supervisors recognized the root cause was systemic, not human.

The learning was critical to the birth of incident management: blaming people for failing when acting in good faith is a doomed path to resilience. Instead, we have to look at the system that allowed or even encouraged the person to fail.

Applied to the pandemic, this lesson is true not just for the importance of forgiving ourselves for our actions that in hindsight were not ideal— maybe you didn’t wear a mask right away, or still sent your kids to schools that were open, or visited your parents— but also for forgiving others who made best-effort decisions that turned out to be regrettable when more information became available. Instead, we need to be asking how the system of processes and expectations allowed them to fail in the first place.

For example in New York City (where I live), school administrators were locked into a battle with the governor, the mayor, and parents about whether or not to shut down. In the absence of executive orders or guidance from the governor (as happened in states like California), parents were left in a state of decision paralysis. Having clearer guidelines in place would have empowered more people to make better decisions. Which leads us to…

Lesson 2: Declare emergencies early and often
It can seem counter-intuitive, but the philosophy of incident management teaches us that even with incomplete information, it is important to declare an emergency early, and to empower more people with the ability to make that declaration.

On factory floors you’ll always see this principle on full display as the “the big red button.” Anyone on the floor, at their own personal discretion, can push the button and halt all of the machinery if they feel something is unsafe or at risk of becoming so, regardless of their reasoning. There is no form to submit before pushing the button, and there is no chain of command to request approval from.

In the airline industry, the crucible of incident response where the stakes are some of the highest, companies even encourage and reward their mechanics and pilots for filing large numbers of seemingly lower priority incident reports, as it’s proven time and time again that this reduces critical failures down the line, and thus saves lives by reducing crashes.

In the early days of the pandemic, Dr. Li Wenliang tried to raise the alarm in Wuhan when he first realized the virus posed a unique threat as he saw patient after patient begin to flood into his clinic, but rather than being empowered and trusted, he was quickly silenced by authorities who did not trust his on-the-ground observations. The big red button was, in this case, encased in the glass of procedure, bureaucracy, and global politics.

Even worse, after China did realize the risk and began to shut down aggressively, governments across the world by and large chose to ignore the threat and wait for it to escalate locally, rather than preemptively taking measures even as local experts warned of a rapidly increasing risk to the global population. The lack of early action ultimately caused the now infamous supply-chain failure when the virus was finally spreading too rapidly to ignore and entire countries’ populations rushed their local stores for critical goods.

Lesson 3: Everyone has a role to play
A common misconception is that it’s solely the engineering team’s responsibility to respond to major incidents and outages, just like we may think it’s the drug companies’ sole responsibility to respond to a pandemic. But the truth is that everyone in the company plays a role and is impacted by every incident, and because of that, everyone needs access to the most up-to-date information on the incident at all times.

At a corporation, that means everyone from your most junior sales hire to your C-Suite should have unobstructed visibility into the response process. The natural resistance to this approach is that these “outsiders” will cause thrash and communications overhead, but the reality is the exact opposite: by giving everyone visibility the need to “go through channels” to request updates and express urgency vanish, as the urgency and status are on full display at all times.

If we think about the pandemic, it should be obvious how critical open and clear information is for each of us — not only to make good decisions in our own day-to-day lives, but to reassure us that the core responding “teams” (our governments, health care workers, pharmaceutical manufacturers and the like) are working with the urgency we would ourselves hope to exert in their shoes.

Lesson 4: Take time to reflect
Perhaps the easiest piece to forget, but really the most important aspect of any successful incident response, actually happens after the incident is over. At tech unicorns like Facebook and Google, those involved and impacted by major outages and failures write documents called “postmortems” to formally record the incident, its impact, the root cause, and critically, the followup steps to be taken to make certain that such an event never occurs again.

These documents are widely distributed within the team or company, allowing others to provide input and to ask questions. Crucially, they’re also written without pointing fingers or assigning blame, instead focusing on the systemic context that allowed the situation to occur and how those processes can be changed for the better with specific follow-up actions.

Teams then come together to reflect and discuss each incident and its postmortem document, allowing everyone to finally bring closure to the event for the responders and the company as a whole, confident that their efforts will result in a more resilient environment moving forwards.

After 2020, it’s more important than ever for each of us to find our own moment to reflect on our experience with the pandemic, and I think we can look to these healthy motions from incident management as a framework for closure. Kintaba’s name comes from the Japanese art form kintsugi, where broken pottery is mended with golden inlay such that the repaired object is more valuable than the original, and I think it’s a powerful metaphor to employ when we think about resilience in our companies and ourselves. Major incidents are ultimately positive opportunities to make our companies, our communities, and ourselves stronger than we were before.

Article Tags

incident management

About John Egan

John Egan is co-founder and CEO of Kintaba, an incident response solution provider.

View all posts by John Egan

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_WTGVKVXEZJ	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_107693958_2	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.

Cookie	Duration	Description
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
_heatmaps_g2g_101137905	10 minutes	No description
cf_7167_id	20 years	No description
cf_7167_person_last_update	session	No description
GoogleAdServingTest	session	No description
prism_252377639	1 month	No description
querylyvid	3 months	No description
xtc	1 year 1 month	No description

What Incident Management teaches us about the pandemic

Article Tags

Subscribe to SDTimes

About John Egan

Related Articles

Report: Most companies are now practicing proactive incident management

How teams can better collaborate on incident management

xMatters to reinvent incident management with new platform advancements

Atlassian targets major incident management with deal to acquire OpsGenie and Jira Ops launch